O'Reilly logo

PDF Explained by John Whitington

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. Document Structure

In this chapter, we leave behind the bits and bytes of the PDF file, and consider the logical structure. We consider the trailer dictionary, document catalog, and page tree. We enumerate the required entries in each object. We then look at two common structures in PDF files: text strings and dates.

Figure 4-1 shows the logical structure of a typical document.

Typical document structure for a two page PDF document
Figure 4-1. Typical document structure for a two page PDF document

Trailer Dictionary

This dictionary, residing in the file’s trailer rather than the main body of the file, is one of the first things to be processed when a program wants to read a PDF document. It contains entries allowing the cross-reference table—and thus the file’s objects—to be read. Its important entries are summarized in Table 4-1.

Table 4-1. Entries in a trailer dictionary (*denotes required entry)
KeyValue typeValue
/Size*IntegerTotal number of entries in the file’s cross-reference table (usually equal to the number of objects in the file plus one).
/Root*Indirect reference to dictionaryThe document catalog.
/Info Indirect reference to dictionaryThe document’s document information dictionary.
/ID Array of two StringsUniquely identifies the file within a work flow. The first string is decided when the file is first created, the second modified by workflow systems when they modify the file.

Here’s an example trailer dictionary:

<<
   /Size 421
   /Root 377 0 R
   /Info 375 0 R
   /ID [<75ff22189ceac848dfa2afec93deee03> <057928614d9711db835e000d937095a2>]
>>

Once the trailer dictionary has been processed, we can go on to read the document information dictionary and the document catalog.

Document Information Dictionary

The document information dictionary contains the creation and modification dates of the file, together with some simple metadata (not to be confused with the more comprehensive XMP metadata discussed in XML Metadata).

Document information dictionary entries are described in Table 4-2. A typical document information dictionary is given in Example 4-1.

Table 4-2. Entries in a document information dictionary. The types “text string” and “date string” are explained later in this chapter.
KeyValue typeValue
/Title text stringThe document’s title. Note that this is nothing to do with any title displayed on the first page.
/Subject text stringThe subject of the document. Again, this is just metadata with no particular rules about content.
/Keywords text stringKeywords associated with this document. No advice is given as to how to structure these.
/Author text stringThe name of the author of the document.
/CreationDate date stringThe date the document was created.
/ModDate date stringThe date the document was last modified.
/Creator text stringThe name of the program which originally created this document, if it started as another format (for example, Microsoft Word).
/Producer text stringThe name of the program which converted this file to PDF, if it started as another format (for example, the format of a word processor).
Example 4-1. Typical document information dictionary
<<
   /ModDate (D:20060926213913+02'00')
   /CreationDate (D:20060926213913+02'00')
   /Title (catalogueproduit-UK.qxd)
   /Creator (QuarkXPress: pictwpstops filter 1.0)
   /Producer (Acrobat Distiller 6.0 for Macintosh)
   /Author (James Smith)
>>

The date string format (for /CreationDate and /ModDate) is discussed in the section Dates. The text string format (which describes how different encodings can be used within the string type) is described in Text Strings.

Document Catalog

The document catalog is the root object of the main object graph, from which all other objects may be reached through indirect references. In Table 4-3, we list the document catalog dictionary entries which are required, and some of the many optional ones, so as to introduce brief PDF topics we don’t cover elsewhere in these pages.

Table 4-3. The document catalog (*denotes required entry)
KeyValue typeValue
/Type*nameMust be /Catalog.
/Pages*indirect reference to dictionaryThe root node of the page tree. Page trees are discussed in Pages and Page Trees.
/PageLabels number treeA number tree giving the page labels for this document. This mechanism allows for pages in a document to have more complicated numbering than just 1,2,3…. For example, the preface of a book may be numbered i,ii,iii..., whilst the main content starts again at 1,2,3….These page labels are displayed in PDF viewers—they have nothing to do with printed output.
/Names dictionaryThe name dictionary. This contains various name trees, which map names to entities, to prevent having to use object numbers to reference them directly.
/Dests dictionaryA dictionary mapping names to destinations. A destination is a description of a place within a PDF document to which a hyperlink sends the user.
/ViewerPreferences dictionaryA viewer preferences dictionary, which allows flags to specify the behavior of a PDF viewer when the document is viewed on screen, such as the page it is opened on, the initial viewing scale and so on.
/PageLayout nameSpecifies the page layout to be used by PDF viewers. Values are /SinglePage, /OneColumn, /TwoColumnLeft, /TwoColumnRight, /TwoPageLeft, /TwoPageRight. (Default: /SinglePage). Details are in Table 28 of ISO 32000-1:2008.
/PageMode nameSpecifies the page mode to be used by PDF viewers. Values are /UseNone, /UseOutlines, /UseThumbs, /FullScreen, /UseOC, /UseAttachments. (Default: /UseNone). Details are in Table 28 of ISO 32000-1:2008.
/Outlines indirect reference to dictionaryThe outline dictionary is the root of the document outline, commonly known as the bookmarks.
/Metadata indirect reference to streamThe document’s XMP metadata—see XML Metadata.

Pages and Page Trees

A page tree, built from page dictionaries, brings together instructions for drawing the graphical and textual content (which we consider in Chapter 5 and Chapter 6) with the resources (fonts, images, and other external data) which those instructions make use of. It also includes the page size, together with a number of other boxes defining cropping and so forth.

The entries in a page dictionary are summarized in Table 4-4.

Table 4-4. Entries in a page dictionary (*denotes required entry)
KeyValue typeValue
/Type*nameMust be /Page.
/Parent*indirect reference to dictionaryThe parent node of this node in the page tree.
/Resources dictionaryThe page’s resources (fonts, images, and so on). If this entry is omitted entirely, the resources are inherited from the parent node in the page tree. If there are really no resources, include this entry but use an empty dictionary.
/Contents indirect reference to stream or array of such referencesThe graphical content of the page in one or more sections. If this entry is missing, the page is empty.
/Rotate integerThe viewing rotation of the page in degrees, clockwise from north. Value must be a multiple of 90. Default value: 0. This applies to both viewing and printing. If this entry is missing, its value is inherited from its parent node in the page tree.
/MediaBox*rectangleThe page’s media box (the size of its media, i.e., paper). For most purposes, the page size. If this entry is missing, it is inherited from its parent node in the page tree.
/CropBox rectangleThe page’s crop box. This defines the region of the page visible by default when a page is displayed or printed. If absent, its value is defined to be the same as the media box.

The rectangle data structure for the media box and the other boxes is an array of four numbers. These define the diagonally opposite corners of the rectangle—the first two elements of the array being the x and y coordinates of one corner, the latter two elements being those of the other. Normally, the lower-left and upper-right corners are given. So, for example:

/MediaBox [0 0 500 800]
/CropBox [100 100 400 700]

defines a 500 by 800 point page with a crop box removing 100 points on each side of the page.

The pages are linked together using a page tree, rather than a simple array. This tree structure makes it faster to find a given page in a document with hundreds or thousands of pages. Good PDF applications build a balanced tree (one with the minimum height for the number of nodes). This ensures that a particular page can be located quickly. The nodes with no children are the pages themselves. An example page tree structure for seven pages is shown in Figure 4-2.

This would be written in PDF objects as shown in Example 4-2. The entries in an intermediate or root page tree node (i.e., not a page itself) are summarized in Table 4-5.

A page tree for seven pages. The exact shape of the tree is left to the individual PDF application. The PDF code for this tree is shown in .
Figure 4-2. A page tree for seven pages. The exact shape of the tree is left to the individual PDF application. The PDF code for this tree is shown in Example 4-2.
Example 4-2. PDF objects used to build the page tree illustrated in Figure 4-2
1 0 obj Root node
<< /Type /Pages /Kids [2 0 R 3 0 R 4 0 R] /Count 7 >>
endobj
2 0 obj Intermediate node
<< /Type /Pages /Kids [5 0 R 6 0 R 7 0 R] /Parent 1 0 R /Count 3 >>
endobj
3 0 obj Intermediate node
<< /Type /Pages /Kids [8 0 R 9 0 R 10 0 R] /Parent 1 0 R /Count 3 >>
endobj
4 0 obj Page 7
<< /Type /Page /Parent 1 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
5 0 obj Page 1
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
6 0 obj Page 2
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
7 0 obj Page 3
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
8 0 obj Page 4
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
9 0 obj Page 5
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
10 0 obj Page 6
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
Table 4-5. Entries in an intermediate or root page tree node (*denotes a required entry)
KeyValue typeValue
/Type*nameMust be /Pages.
/Kids*array of indirect referencesThe immediate child page-tree nodes of this node.
/Count*integerThe number of page nodes (not other page tree nodes) which are eventual children of this node.
/Parent indirect reference to page tree nodeReference to the parent of this node (the node of which this is a child). Must be present if not the root node of the page tree.

In this tree, any page can be found at most two indirect references away from the root node.

Text Strings

Strings outside of the actual textual content of a page (e.g., bookmark names, document information etc.) are known as text strings. They are encoded using either PDFDocEncoding or (in more recent documents) Unicode. PDFDocEncoding is a based on the ISO Latin-1 Encoding. It is documented fully in Annex D of ISO Standard 32000-1:2008.

Text strings which are encoded as Unicode are distinguished by looking at the first two bytes: these will be 254 followed by 255. This is the Unicode byte-order marker U+FEFF, which indicates the UTF16BE encoding. This means a PDFDocEncoding string can’t begin with þ (254) followed by ÿ (255), but this is unlikely to occur in any reasonable circumstance.

Dates

The creation and modification dates /CreationDate and /ModDate in the document information dictionary are examples of the PDF date format, which encodes a date in a string, including information about the time zone.

A date string has the format:

(D:YYYYMMDDHHmmSSOHH'mm')

where the parentheses indicate a string as usual. The other parts of the date are summarized in Table 4-6.

Table 4-6. PDF date format constituents
PortionMeaning
YYYY The year, in four digits, e.g., 2008.
MM The month, in two digits from 01 to 12.
DD The day, in two digits from 01 to 31.
HH The hour, in two digits from 00 to 23.
mm The minute, in two digits from 00 to 59.
SS The second, in two digits from 00 to 59.
O The relationship of local time to Universal Time, either +, - or Z. + signifies local time is later than UT, - earlier, and Z equal to Universal Time.
HH' The absolute value of the offset from Universal Time in hours, in two digits from 00 to 23.
mm' The absolute value of the offset from Universal Time in minutes, in two digits from 00 to 59.

All parts of the date after the year are optional. For example, (D:1999) is perfectly valid. Plainly, though, if you omit one part, you must omit everything which follows, otherwise the result would be ambiguous. The default values for DD and MM is 01, for all other parts, the default is zeros.

For example:

(D:20060926213913+02'00')

represents September 26th 2006 at 9:39:13 p.m, in a time zone two hours ahead of Universal Time.

Putting it Together

This is a manually-created text, to be processed into a valid PDF file by pdftk using the method introduced in Chapter 2. It is a three page document, with document information dictionary and page tree. Figure 4-3 shows this document displayed in Acrobat Reader. Figure 4-4 is the corresponding object graph.

Example 4-3. A three page document with document information dictionary
%PDF-1.1 Header
1 0 obj Top-level of page tree: has two children—page one and an intermediate page tree node
<< /Kids [2 0 R 3 0 R] /Type /Pages /Count 3 >>
endobj 
4 0 obj Contents stream for page one
<< >>
stream
1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page One) Tj ET 
endstream 
endobj 
2 0 obj Page one
<<
  /Rotate 0
  /Parent 1 0 R
  /Resources 
    << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >>
  /MediaBox [0.000000 0.000000 595.275590551 841.88976378]
  /Type /Page
  /Contents [4 0 R]
>>
endobj 
5 0 obj Document catalog
<< /PageLayout /TwoColumnLeft /Pages 1 0 R /Type /Catalog >>
endobj 
6 0 obj Page three 
<<
  /Rotate 0
  /Parent 3 0 R
  /Resources 
    << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >>
  /MediaBox [0.000000 0.000000 595.275590551 841.88976378]
  /Type /Page
  /Contents [7 0 R]
>>
endobj 
3 0 obj Intermediate page tree node, linking to pages two and three
<< /Parent 1 0 R /Kids [8 0 R 6 0 R] /Count 2 /Type /Pages >>
endobj 
8 0 obj Page two 
<<
  /Rotate 270
  /Parent 3 0 R
  /Resources 
    << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >>
  /MediaBox [0.000000 0.000000 595.275590551 841.88976378]
  /Type /Page
  /Contents [9 0 R]
>>
endobj 
9 0 obj Content stream for page two
<< >>
stream
q 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Two) Tj ET Q
1. 0.000000 0.000000 1. 50. 750 cm BT /F0 16 Tf ((Rotated by 270 degrees)) Tj ET
endstream 
endobj 
7 0 obj Content stream for page three
<< >>
stream
1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Three) Tj ET 
endstream 
endobj 
10 0 obj Document information dictionary 
<<
  /Title (PDF Explained Example)
  /Author (John Whitington)
  /Producer (Manually Created)
  /ModDate (D:20110313002346Z)
  /CreationDate (D:2011)
>>
endobj xref
0 11
trailer Trailer dictionary
<<
  /Info 10 0 R
  /Root 5 0 R
  /Size 11
  /ID [<75ff22189ceac848dfa2afec93deee03> <75ff22189ceac848dfa2afec93deee03>]
>>
startxref
0
%%EOF
converted to a valid PDF with pdftk and displayed in Acrobat Reader
Figure 4-3. Example 4-3 converted to a valid PDF with pdftk and displayed in Acrobat Reader
Object graph for
Figure 4-4. Object graph for Example 4-3

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required