The Structure of PDF

Technically, PDF is a complex language. The specification is 400 pages long. If you don’t want to know the details, skip to the section Putting It Together: A High-Volume Invoicing System. If you do, it’d be a good idea to open one of the sample PDF files provided with this chapter; unlike most you will find on the Web, they are uncompressed and numbered in a sensible order. We’ve provided a brief roadmap to the PDF format as we feel that it offers many benefits, and you might want to add your own extensions in the future.

The outer layer of the PDF format provides overall document structure, specifying pages, fonts used, and advanced features such as tables of contents, special effects, and so on. Each page is a separate object and contains a stream of page-marking operators; basically, highly abbreviated PostScript commands. The snippet of PostScript you saw earlier would end up like this:

72 720 m
72 72 l
/F5 24 Tf 42 TL
80 720 Td
('Hello World') Tj

Unfortunately this code, which can at least be decoded given time and you know where to look, can be compressed in a binary form and is buried inside an outer layer that’s quite complex. The outer layer consists of a series of numbered objects (don’t you love that word?) including pages, outlines, clickable links, font resources, and many other elements. These are delimited by the keywords obj and endobj and numbered within the file. Here’s a Catalog object, which sits at the top of PDF’s object model:

1 0 obj << ...

Get Python Programming On Win32 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.