Chapter 4. Document Structure
In this chapter, we leave behind the bits and bytes of the PDF file, and consider the logical structure. We consider the trailer dictionary, document catalog, and page tree. We enumerate the required entries in each object. We then look at two common structures in PDF files: text strings and dates.
Figure 4-1 shows the logical structure of a typical document.
This dictionary, residing in the file’s trailer rather than the main body of the file, is one of the first things to be processed when a program wants to read a PDF document. It contains entries allowing the cross-reference table—and thus the file’s objects—to be read. Its important entries are summarized in Table 4-1.
|Integer||Total number of entries in the file’s cross-reference table (usually equal to the number of objects in the file plus one).|
|Indirect reference to dictionary||The document catalog.|
||Indirect reference to dictionary||The document’s document information dictionary.|
||Array of two Strings||Uniquely identifies the file within a work flow. The first string is decided when the file is first created, the second modified by workflow systems when they modify the file.|
Here’s an example trailer dictionary:
<< /Size 421 /Root 377 0 R /Info 375 0 R /ID [<75ff22189ceac848dfa2afec93deee03> <057928614d9711db835e000d937095a2>] >>
Once the trailer dictionary has been processed, we can go on to read the document information dictionary and the document catalog.
Document Information Dictionary
The document information dictionary contains the creation and modification dates of the file, together with some simple metadata (not to be confused with the more comprehensive XMP metadata discussed in XML Metadata).
Document information dictionary entries are described in Table 4-2. A typical document information dictionary is given in Example 4-1.
||text string||The document’s title. Note that this is nothing to do with any title displayed on the first page.|
||text string||The subject of the document. Again, this is just metadata with no particular rules about content.|
||text string||Keywords associated with this document. No advice is given as to how to structure these.|
||text string||The name of the author of the document.|
||date string||The date the document was created.|
||date string||The date the document was last modified.|
||text string||The name of the program which originally created this document, if it started as another format (for example, “Microsoft Word”).|
||text string||The name of the program which converted this file to PDF, if it started as another format (for example, the format of a word processor).|
<< /ModDate (D:20060926213913+02'00') /CreationDate (D:20060926213913+02'00') /Title (catalogueproduit-UK.qxd) /Creator (QuarkXPress: pictwpstops filter 1.0) /Producer (Acrobat Distiller 6.0 for Macintosh) /Author (James Smith) >>
The date string format (for
/ModDate) is discussed in the section Dates. The text string format (which
describes how different encodings can be used within the string type) is
described in Text Strings.
The document catalog is the root object of the main object graph, from which all other objects may be reached through indirect references. In Table 4-3, we list the document catalog dictionary entries which are required, and some of the many optional ones, so as to introduce brief PDF topics we don’t cover elsewhere in these pages.
|name||Must be |
|indirect reference to dictionary||The root node of the page tree. Page trees are discussed in Pages and Page Trees.|
||number tree||A number tree giving the page labels for this document. This mechanism allows for pages in a document to have more complicated numbering than just 1,2,3…. For example, the preface of a book may be numbered i,ii,iii..., whilst the main content starts again at 1,2,3….These page labels are displayed in PDF viewers—they have nothing to do with printed output.|
||dictionary||The name dictionary. This contains various name trees, which map names to entities, to prevent having to use object numbers to reference them directly.|
||dictionary||A dictionary mapping names to destinations. A destination is a description of a place within a PDF document to which a hyperlink sends the user.|
||dictionary||A viewer preferences dictionary, which allows flags to specify the behavior of a PDF viewer when the document is viewed on screen, such as the page it is opened on, the initial viewing scale and so on.|
||name||Specifies the page layout to be used by PDF viewers. Values
||name||Specifies the page mode to be used by PDF viewers. Values
||indirect reference to dictionary||The outline dictionary is the root of the document outline, commonly known as the bookmarks.|
||indirect reference to stream||The document’s XMP metadata—see XML Metadata.|
Pages and Page Trees
A page tree, built from page dictionaries, brings together instructions for drawing the graphical and textual content (which we consider in Chapter 5 and Chapter 6) with the resources (fonts, images, and other external data) which those instructions make use of. It also includes the page size, together with a number of other boxes defining cropping and so forth.
The entries in a page dictionary are summarized in Table 4-4.
|name||Must be |
|indirect reference to dictionary||The parent node of this node in the page tree.|
||dictionary||The page’s resources (fonts, images, and so on). If this entry is omitted entirely, the resources are inherited from the parent node in the page tree. If there are really no resources, include this entry but use an empty dictionary.|
||indirect reference to stream or array of such references||The graphical content of the page in one or more sections. If this entry is missing, the page is empty.|
||integer||The viewing rotation of the page in degrees, clockwise from north. Value must be a multiple of 90. Default value: 0. This applies to both viewing and printing. If this entry is missing, its value is inherited from its parent node in the page tree.|
|rectangle||The page’s media box (the size of its media, i.e., paper). For most purposes, the page size. If this entry is missing, it is inherited from its parent node in the page tree.|
||rectangle||The page’s crop box. This defines the region of the page visible by default when a page is displayed or printed. If absent, its value is defined to be the same as the media box.|
The rectangle data structure for the media box and the other boxes is an array of four numbers. These define the diagonally opposite corners of the rectangle—the first two elements of the array being the x and y coordinates of one corner, the latter two elements being those of the other. Normally, the lower-left and upper-right corners are given. So, for example:
/MediaBox [0 0 500 800] /CropBox [100 100 400 700]
defines a 500 by 800 point page with a crop box removing 100 points on each side of the page.
The pages are linked together using a page tree, rather than a simple array. This tree structure makes it faster to find a given page in a document with hundreds or thousands of pages. Good PDF applications build a balanced tree (one with the minimum height for the number of nodes). This ensures that a particular page can be located quickly. The nodes with no children are the pages themselves. An example page tree structure for seven pages is shown in Figure 4-2.
This would be written in PDF objects as shown in Example 4-2. The entries in an intermediate or root page tree node (i.e., not a page itself) are summarized in Table 4-5.
1 0 obj Root node << /Type /Pages /Kids [2 0 R 3 0 R 4 0 R] /Count 7 >> endobj 2 0 obj Intermediate node << /Type /Pages /Kids [5 0 R 6 0 R 7 0 R] /Parent 1 0 R /Count 3 >> endobj 3 0 obj Intermediate node << /Type /Pages /Kids [8 0 R 9 0 R 10 0 R] /Parent 1 0 R /Count 3 >> endobj 4 0 obj Page 7 << /Type /Page /Parent 1 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 5 0 obj Page 1 << /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 6 0 obj Page 2 << /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 7 0 obj Page 3 << /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 8 0 obj Page 4 << /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 9 0 obj Page 5 << /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 10 0 obj Page 6 << /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj
|name||Must be |
|array of indirect references||The immediate child page-tree nodes of this node.|
|integer||The number of page nodes (not other page tree nodes) which are eventual children of this node.|
||indirect reference to page tree node||Reference to the parent of this node (the node of which this is a child). Must be present if not the root node of the page tree.|
In this tree, any page can be found at most two indirect references away from the root node.
Strings outside of the actual textual content of a page (e.g., bookmark names, document information etc.) are known as text strings. They are encoded using either PDFDocEncoding or (in more recent documents) Unicode. PDFDocEncoding is a based on the ISO Latin-1 Encoding. It is documented fully in Annex D of ISO Standard 32000-1:2008.
Text strings which are encoded as Unicode are distinguished by looking at the first two bytes: these will be 254 followed by 255. This is the Unicode byte-order marker U+FEFF, which indicates the UTF16BE encoding. This means a PDFDocEncoding string can’t begin with þ (254) followed by ÿ (255), but this is unlikely to occur in any reasonable circumstance.
The creation and modification dates
/ModDate in the document information dictionary
are examples of the PDF date format, which encodes a date in a string,
including information about the time zone.
A date string has the format:
where the parentheses indicate a string as usual. The other parts of the date are summarized in Table 4-6.
||The year, in four digits, e.g., |
||The month, in two digits from |
||The day, in two digits from |
||The hour, in two digits from |
||The minute, in two digits from |
||The second, in two digits from |
||The relationship of local time to Universal Time, either
||The absolute value of the offset from Universal Time in
hours, in two digits from |
||The absolute value of the offset from Universal Time in
minutes, in two digits from |
All parts of the date after the year are optional. For example,
(D:1999) is perfectly valid. Plainly,
though, if you omit one part, you must omit everything which follows,
otherwise the result would be ambiguous. The default values for DD and MM
is 01, for all other parts, the default is zeros.
represents September 26th 2006 at 9:39:13 p.m, in a time zone two hours ahead of Universal Time.
Putting it Together
This is a manually-created text, to be processed into a valid PDF file by pdftk using the method introduced in Chapter 2. It is a three page document, with document information dictionary and page tree. Figure 4-3 shows this document displayed in Acrobat Reader. Figure 4-4 is the corresponding object graph.
%PDF-1.1 Header 1 0 obj Top-level of page tree: has two children—page one and an intermediate page tree node << /Kids [2 0 R 3 0 R] /Type /Pages /Count 3 >> endobj 4 0 obj Contents stream for page one << >> stream 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page One) Tj ET endstream endobj 2 0 obj Page one << /Rotate 0 /Parent 1 0 R /Resources << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> /MediaBox [0.000000 0.000000 595.275590551 841.88976378] /Type /Page /Contents [4 0 R] >> endobj 5 0 obj Document catalog << /PageLayout /TwoColumnLeft /Pages 1 0 R /Type /Catalog >> endobj 6 0 obj Page three << /Rotate 0 /Parent 3 0 R /Resources << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> /MediaBox [0.000000 0.000000 595.275590551 841.88976378] /Type /Page /Contents [7 0 R] >> endobj 3 0 obj Intermediate page tree node, linking to pages two and three << /Parent 1 0 R /Kids [8 0 R 6 0 R] /Count 2 /Type /Pages >> endobj 8 0 obj Page two << /Rotate 270 /Parent 3 0 R /Resources << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> /MediaBox [0.000000 0.000000 595.275590551 841.88976378] /Type /Page /Contents [9 0 R] >> endobj 9 0 obj Content stream for page two << >> stream q 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Two) Tj ET Q 1. 0.000000 0.000000 1. 50. 750 cm BT /F0 16 Tf ((Rotated by 270 degrees)) Tj ET endstream endobj 7 0 obj Content stream for page three << >> stream 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Three) Tj ET endstream endobj 10 0 obj Document information dictionary << /Title (PDF Explained Example) /Author (John Whitington) /Producer (Manually Created) /ModDate (D:20110313002346Z) /CreationDate (D:2011) >> endobj xref 0 11 trailer Trailer dictionary << /Info 10 0 R /Root 5 0 R /Size 11 /ID [<75ff22189ceac848dfa2afec93deee03> <75ff22189ceac848dfa2afec93deee03>] >> startxref 0 %%EOF
Get PDF Explained now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.