pdf editors

Paul Sladen sounder at paul.sladen.org
Tue Mar 15 20:25:21 UTC 2005


On Tue, 15 Mar 2005, Patrick Wagstrom wrote:

Hello Patrick,

> The compressed nature of PDF files really isn't the problem, they're
> just compressed using zip compression[1], nothing big.

Yes, individual streams in a PDF (for example, lists of drawing instructions
or image data) can be compressed.  Flate (deflate, aka gzip aka zlib)
algorithm support was introduced in PDF-1.2 (beforehand there was only LZW).

Outside the compressed streams, PDF files are just text documents.  You can
open them up in a text-editor and view them.  I find giving people the
following example file is the easiest way of showing what a PDF file is.

  http://www.paul.sladen.org/projects/pdfutils/minimal.pdf.txt
  http://www.paul.sladen.org/projects/pdfutils/minimal.pdf

You can open that file straight in a PDF viewer such as Xpdf, mupdf or
Ghostscript.  (Remove the .txt if it makes life easier!).

> The issue is that these file formats; PDF, PS, EPS, etc; are made not to
> be edited.

PostScript, as a fully-blown programming language is certainly meant to be
read, parsed and executed as a continuous stream.  The situation with PDF is
a little more varied.

The PDF command set looks very similar the PostScript drawing operators, but
reduced to single letters, for example 'l' for 'lineto' and 'm' for
'moveto'.  PDF also removes the conditional operators ('for', 'if'...)
leaving just the unrolled result in the render stream.

One thing that was quickly discovered with PostScript is that to manipulate
the document, you don't actually need to parse or understand it.  If you're
rearranging the page order, all you need to know is where to start copying &
pasting and where to stop.  This markup (as special comments) is called the
Document Structuring Conventions (DSC).

PDF provides this facility straight out of the box;  one of the advantages
it has over the raw, continuous stream of PostScript is a clear object
hierarchy from the outset:

  /Root -> /Catalog -> /Pages -> /Page -> /Content stream

Everything else is also bagged up neatly in additional objects, including
Fonts and Image data.  Each object has not only a unique ID, but also a
generation number allowing updates within the same file but preserving any
digitally-signed content already in the file.

Instead of overwriting a PDF each time, it's possible to append to the end
(actually somewhat harder than just rewriting, in practice...).  The reason
this is possible is the same reason that PDF files are instant-access
(however big they are) and fairly efficient to search.

At the end of the PDF is a binary 'Xref' Cross-reference table full of
pointers to the starting-offset of each object in the file---once you know
which page you need, it's possible to seek straight to the objects in
question, without having to start executing from the beginning each time as
is the case with unstructured PostScript.

> They're designed so I can put out a piece of work and ensure that it
> stays the same on everyones computer.

This is perhaps key to why PDF is so successful in what it does.  PDF
emphasises Presentation over the Logical Structure of the Document (TeX does
the opposite) ensuring that, as you say, it will look the same same
everywhere.

One of the reasons this is possible is that PDF requires that everything
needed to display the document (including any non-standard Fonts used) has
to be embedded within the file[0].  This avoids the problem seen so
frequently of one user sending another user a word-processing document, only
to find it looks completely different when a font has had to be substituted
for another.

  [0] Technically, Font Metrics are required and the glyph data recommended.

However close the approximation of the second font, the characters will
always be a different shape with differing metrics and kerning.  Just like
the Butterfly and the Tornado story, one tiny wrapping change at the top of
the document can cause a /completely/ different result by the end!

> There is no concept of "word wrap" inside of a PDF document.

Something that changed this and made the possibility of true PDF editing
possible is the introduction of Tagged/Structured PDF into the 1.4
specification;

This is additional markup, somewhat like HTML to say ''..this is a word'',
''..this is a heading'', ''..this is a paragraph''.  There is still not much
information in the file about the template aspects, like describing the
width of column that text should be reflowed into;  but that information can
be recovered to a certain degree either by looking for private Tags used by
the application that created the document originally---or by inference!

Analysing the document in the same way that an OCR (Optical Character
Recognition) program does, it's possible to make good guesses about what
content is related to each other.  If you see several text-drawing commands
with the same starting left or right-edge and spaced at an equal distance in
the other direction you can guess that is might be a column of text:

      |... ...... ..
      |.. ...  ... ...
      |.... ... ...

The same facility is used for extracting text from a PDF to pasting and for
reflowing 'eBooks' onto small-screen devices like the Palm Pilot.  You are
however reflowing it using your /own/ algorithm and not the one used to
layout the text originally, some decisions will end up being different since
there's more-than-one-way-to-break-a-word!

> Also, one of the issues with PS files is that they're a series of
> instructions that tell the renderer/printer how to move around.  Usually
> positions are given relative to the previous.

PostScript, like TeX with its macro-expansion, is designed to be parsed in a
linear fashion with everything affecting what follows it.  What EPS
(Encapulsated PostScript) introduces is a limit on the use of state-altering
commands and a requirement that you leave the drawing state as you found it.

This restriction on preserving state is the reason that EPS programs can be
inserted into a word-processing document, even if they can't be displayed on
the screen.  When printing the whole file, the Word Processor knows it can
safely stitch in a section of the EPS at the appropriate point.

PDF is a little saner in this area; ...to a point.  The limit of damage (the
amount that needs to be redrawn or recalculated) is contained to an extent.  
Primarily at the render-stream level or, if the display library is more
intelligent at a lower level of individual drawing commands.

It's possible to generate a 'scene-graph', by using the mandatory Bounding-
Boxes it's possible to calculate a good hierarchy of what is going to touch
or overlap and where to 're-run' the glyph/render-stream from.

Of course, ...at the same time we got Tagged Markup in PDF-1.4, we also got
Alpha Transparency and Porter-Duff compositing, leading to a whole different
and exciting rendering model!  :-)

	-Paul

(Who's been toying around with writing a GPL'ed PDF editor for the last
couple of years.  Nope, big project and nothing for you to play with yet).
-- 
I didn't know it snowed here!  London, GB





More information about the ubuntu-users mailing list