PDF Explained

Chapter 4. Document Structure

In this chapter, we leave behind the bits and bytes of the PDF file, and consider the logical structure. We consider the trailer dictionary, document catalog, and page tree. We enumerate the required entries in each object. We then look at two common structures in PDF files: text strings and dates.

Figure 4-1 shows the logical structure of a typical document.

Figure 4-1. Typical document structure for a two page PDF document

Trailer Dictionary

This dictionary, residing in the file’s trailer rather than the main body of the file, is one of the first things to be processed when a program wants to read a PDF document. It contains entries allowing the cross-reference table—and thus the file’s objects—to be read. Its important entries are summarized in Table 4-1.

Table 4-1. Entries in a trailer dictionary (*denotes required entry)

Key	Value type	Value
`/Size`*	Integer	Total number of entries in the file’s cross-reference table (usually equal to the number of objects in the file plus one).
`/Root`*	Indirect reference to dictionary	The document catalog.
`/Info`	Indirect reference to dictionary	The document’s document information dictionary.
`/ID`	Array of two Strings	Uniquely identifies the file within a work flow. The first string is decided when the file is first created, the second modified by workflow systems when they modify the file.

Here’s an example trailer dictionary:

<<
   /Size 421
   /Root 377 0 R
   /Info 375 0 R
   /ID [<75ff22189ceac848dfa2afec93deee03> <057928614d9711db835e000d937095a2>]
>>

Once the trailer dictionary has been processed, we can go on to read the document information dictionary and the document catalog.

Document Information Dictionary

The document information dictionary contains the creation and modification dates of the file, together with some simple metadata (not to be confused with the more comprehensive XMP metadata discussed in XML Metadata).

Document information dictionary entries are described in Table 4-2. A typical document information dictionary is given in Example 4-1.

Table 4-2. Entries in a document information dictionary. The types “text string” and “date string” are explained later in this chapter.

Key	Value type	Value
`/Title`	text string	The document’s title. Note that this is nothing to do with any title displayed on the first page.
`/Subject`	text string	The subject of the document. Again, this is just metadata with no particular rules about content.
`/Keywords`	text string	Keywords associated with this document. No advice is given as to how to structure these.
`/Author`	text string	The name of the author of the document.
`/CreationDate`	date string	The date the document was created.
`/ModDate`	date string	The date the document was last modified.
`/Creator`	text string	The name of the program which originally created this document, if it started as another format (for example, “Microsoft Word”).
`/Producer`	text string	The name of the program which converted this file to PDF, if it started as another format (for example, the format of a word processor).

Example 4-1. Typical document information dictionary

<<
   /ModDate (D:20060926213913+02'00')
   /CreationDate (D:20060926213913+02'00')
   /Title (catalogueproduit-UK.qxd)
   /Creator (QuarkXPress: pictwpstops filter 1.0)
   /Producer (Acrobat Distiller 6.0 for Macintosh)
   /Author (James Smith)
>>

The date string format (for /CreationDate and /ModDate) is discussed in the section Dates. The text string format (which describes how different encodings can be used within the string type) is described in Text Strings.

Document Catalog

The document catalog is the root object of the main object graph, from which all other objects may be reached through indirect references. In Table 4-3, we list the document catalog dictionary entries which are required, and some of the many optional ones, so as to introduce brief PDF topics we don’t cover elsewhere in these pages.

Table 4-3. The document catalog (*denotes required entry)

Key	Value type	Value
`/Type`*	name	Must be `/Catalog`.
`/Pages`*	indirect reference to dictionary	The root node of the page tree. Page trees are discussed in Pages and Page Trees.
`/PageLabels`	number tree	A number tree giving the page labels for this document. This mechanism allows for pages in a document to have more complicated numbering than just 1,2,3…. For example, the preface of a book may be numbered i,ii,iii..., whilst the main content starts again at 1,2,3….These page labels are displayed in PDF viewers—they have nothing to do with printed output.
`/Names`	dictionary	The name dictionary. This contains various name trees, which map names to entities, to prevent having to use object numbers to reference them directly.
`/Dests`	dictionary	A dictionary mapping names to destinations. A destination is a description of a place within a PDF document to which a hyperlink sends the user.
`/ViewerPreferences`	dictionary	A viewer preferences dictionary, which allows flags to specify the behavior of a PDF viewer when the document is viewed on screen, such as the page it is opened on, the initial viewing scale and so on.
`/PageLayout`	name	Specifies the page layout to be used by PDF viewers. Values are `/SinglePage`, `/OneColumn`, `/TwoColumnLeft`, `/TwoColumnRight`, `/TwoPageLeft`, `/TwoPageRight`. (Default: `/SinglePage`). Details are in Table 28 of ISO 32000-1:2008.
`/PageMode`	name	Specifies the page mode to be used by PDF viewers. Values are `/UseNone`, `/UseOutlines`, `/UseThumbs`, `/FullScreen`, `/UseOC`, `/UseAttachments`. (Default: `/UseNone`). Details are in Table 28 of ISO 32000-1:2008.
`/Outlines`	indirect reference to dictionary	The outline dictionary is the root of the document outline, commonly known as the bookmarks.
`/Metadata`	indirect reference to stream	The document’s XMP metadata—see XML Metadata.

Pages and Page Trees

A page tree, built from page dictionaries, brings together instructions for drawing the graphical and textual content (which we consider in Chapter 5 and Chapter 6) with the resources (fonts, images, and other external data) which those instructions make use of. It also includes the page size, together with a number of other boxes defining cropping and so forth.

The entries in a page dictionary are summarized in Table 4-4.

Table 4-4. Entries in a page dictionary (*denotes required entry)

Key	Value type	Value
`/Type`*	name	Must be `/Page`.
`/Parent`*	indirect reference to dictionary	The parent node of this node in the page tree.
`/Resources`	dictionary	The page’s resources (fonts, images, and so on). If this entry is omitted entirely, the resources are inherited from the parent node in the page tree. If there are really no resources, include this entry but use an empty dictionary.
`/Contents`	indirect reference to stream or array of such references	The graphical content of the page in one or more sections. If this entry is missing, the page is empty.
`/Rotate`	integer	The viewing rotation of the page in degrees, clockwise from north. Value must be a multiple of 90. Default value: 0. This applies to both viewing and printing. If this entry is missing, its value is inherited from its parent node in the page tree.
`/MediaBox`*	rectangle	The page’s media box (the size of its media, i.e., paper). For most purposes, the page size. If this entry is missing, it is inherited from its parent node in the page tree.
`/CropBox`	rectangle	The page’s crop box. This defines the region of the page visible by default when a page is displayed or printed. If absent, its value is defined to be the same as the media box.

The rectangle data structure for the media box and the other boxes is an array of four numbers. These define the diagonally opposite corners of the rectangle—the first two elements of the array being the x and y coordinates of one corner, the latter two elements being those of the other. Normally, the lower-left and upper-right corners are given. So, for example:

/MediaBox [0 0 500 800]
/CropBox [100 100 400 700]

defines a 500 by 800 point page with a crop box removing 100 points on each side of the page.

The pages are linked together using a page tree, rather than a simple array. This tree structure makes it faster to find a given page in a document with hundreds or thousands of pages. Good PDF applications build a balanced tree (one with the minimum height for the number of nodes). This ensures that a particular page can be located quickly. The nodes with no children are the pages themselves. An example page tree structure for seven pages is shown in Figure 4-2.

This would be written in PDF objects as shown in Example 4-2. The entries in an intermediate or root page tree node (i.e., not a page itself) are summarized in Table 4-5.

Figure 4-2. A page tree for seven pages. The exact shape of the tree is left to the individual PDF application. The PDF code for this tree is shown in Example 4-2.

Example 4-2. PDF objects used to build the page tree illustrated in Figure 4-2

1 0 obj Root node
<< /Type /Pages /Kids [2 0 R 3 0 R 4 0 R] /Count 7 >>
endobj
2 0 obj Intermediate node
<< /Type /Pages /Kids [5 0 R 6 0 R 7 0 R] /Parent 1 0 R /Count 3 >>
endobj
3 0 obj Intermediate node
<< /Type /Pages /Kids [8 0 R 9 0 R 10 0 R] /Parent 1 0 R /Count 3 >>
endobj
4 0 obj Page 7
<< /Type /Page /Parent 1 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
5 0 obj Page 1
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
6 0 obj Page 2
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
7 0 obj Page 3
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
8 0 obj Page 4
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
9 0 obj Page 5
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
10 0 obj Page 6
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj

Table 4-5. Entries in an intermediate or root page tree node (*denotes a required entry)

Key	Value type	Value
`/Type`*	name	Must be `/Pages`.
`/Kids`*	array of indirect references	The immediate child page-tree nodes of this node.
`/Count`*	integer	The number of page nodes (not other page tree nodes) which are eventual children of this node.
`/Parent`	indirect reference to page tree node	Reference to the parent of this node (the node of which this is a child). Must be present if not the root node of the page tree.

In this tree, any page can be found at most two indirect references away from the root node.

Text Strings

Strings outside of the actual textual content of a page (e.g., bookmark names, document information etc.) are known as text strings. They are encoded using either PDFDocEncoding or (in more recent documents) Unicode. PDFDocEncoding is a based on the ISO Latin-1 Encoding. It is documented fully in Annex D of ISO Standard 32000-1:2008.

Text strings which are encoded as Unicode are distinguished by looking at the first two bytes: these will be 254 followed by 255. This is the Unicode byte-order marker U+FEFF, which indicates the UTF16BE encoding. This means a PDFDocEncoding string can’t begin with þ (254) followed by ÿ (255), but this is unlikely to occur in any reasonable circumstance.

Dates

The creation and modification dates /CreationDate and /ModDate in the document information dictionary are examples of the PDF date format, which encodes a date in a string, including information about the time zone.

A date string has the format:

(D:YYYYMMDDHHmmSSOHH'mm')

where the parentheses indicate a string as usual. The other parts of the date are summarized in Table 4-6.

Table 4-6. PDF date format constituents

Portion	Meaning
`YYYY`	The year, in four digits, e.g., `2008`.
`MM`	The month, in two digits from `01` to `12`.
`DD`	The day, in two digits from `01` to `31`.
`HH`	The hour, in two digits from `00` to `23`.
`mm`	The minute, in two digits from `00` to `59`.
`SS`	The second, in two digits from `00` to `59`.
`O`	The relationship of local time to Universal Time, either `+`, `-` or `Z`. `+` signifies local time is later than UT, `-` earlier, and `Z` equal to Universal Time.
`HH'`	The absolute value of the offset from Universal Time in hours, in two digits from `00` to `23`.
`mm'`	The absolute value of the offset from Universal Time in minutes, in two digits from `00` to `59`.

All parts of the date after the year are optional. For example, (D:1999) is perfectly valid. Plainly, though, if you omit one part, you must omit everything which follows, otherwise the result would be ambiguous. The default values for DD and MM is 01, for all other parts, the default is zeros.

For example:

(D:20060926213913+02'00')

represents September 26th 2006 at 9:39:13 p.m, in a time zone two hours ahead of Universal Time.

Putting it Together

This is a manually-created text, to be processed into a valid PDF file by pdftk using the method introduced in Chapter 2. It is a three page document, with document information dictionary and page tree. Figure 4-3 shows this document displayed in Acrobat Reader. Figure 4-4 is the corresponding object graph.

Example 4-3. A three page document with document information dictionary

%PDF-1.1 Header
1 0 obj Top-level of page tree: has two children—page one and an intermediate page tree node
<< /Kids [2 0 R 3 0 R] /Type /Pages /Count 3 >>
endobj 
4 0 obj Contents stream for page one
<< >>
stream
1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page One) Tj ET 
endstream 
endobj 
2 0 obj Page one
<<
  /Rotate 0
  /Parent 1 0 R
  /Resources 
    << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >>
  /MediaBox [0.000000 0.000000 595.275590551 841.88976378]
  /Type /Page
  /Contents [4 0 R]
>>
endobj 
5 0 obj Document catalog
<< /PageLayout /TwoColumnLeft /Pages 1 0 R /Type /Catalog >>
endobj 
6 0 obj Page three 
<<
  /Rotate 0
  /Parent 3 0 R
  /Resources 
    << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >>
  /MediaBox [0.000000 0.000000 595.275590551 841.88976378]
  /Type /Page
  /Contents [7 0 R]
>>
endobj 
3 0 obj Intermediate page tree node, linking to pages two and three
<< /Parent 1 0 R /Kids [8 0 R 6 0 R] /Count 2 /Type /Pages >>
endobj 
8 0 obj Page two 
<<
  /Rotate 270
  /Parent 3 0 R
  /Resources 
    << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >>
  /MediaBox [0.000000 0.000000 595.275590551 841.88976378]
  /Type /Page
  /Contents [9 0 R]
>>
endobj 
9 0 obj Content stream for page two
<< >>
stream
q 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Two) Tj ET Q
1. 0.000000 0.000000 1. 50. 750 cm BT /F0 16 Tf ((Rotated by 270 degrees)) Tj ET
endstream 
endobj 
7 0 obj Content stream for page three
<< >>
stream
1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Three) Tj ET 
endstream 
endobj 
10 0 obj Document information dictionary 
<<
  /Title (PDF Explained Example)
  /Author (John Whitington)
  /Producer (Manually Created)
  /ModDate (D:20110313002346Z)
  /CreationDate (D:2011)
>>
endobj xref
0 11
trailer Trailer dictionary
<<
  /Info 10 0 R
  /Root 5 0 R
  /Size 11
  /ID [<75ff22189ceac848dfa2afec93deee03> <75ff22189ceac848dfa2afec93deee03>]
>>
startxref
0
%%EOF