Chapter 3. File Structure

In this chapter, we describe the layout and content of the PDF file’s four main sections, and the syntax of the objects which make up each one. We also outline the process of reading a PDF file into a high level data structure, and the converse operation of writing that structure to a PDF file.

File Layout

A simple valid PDF file has four parts, in order:

  1. The header, which gives the PDF version number.

  2. The body, containing the pages, graphical content, and much of the ancillary information, all encoded as a series of objects.

  3. The cross-reference table, which lists the position of each object within the file, to facilitate random access.

  4. The trailer including the trailer dictionary, which helps to locate each part of the file and lists various pieces of metadata which can be read without processing the whole file.

For reference, we reproduce the Hello, World PDF from Chapter 2 as Example 3-1. The first line of each of the four sections has been annotated.

Example 3-1. A small PDF file
%PDF-1.1 Header starts here
%âãÏÓ 
1 0 obj Body starts here
<<
/Kids [2 0 R]
/Count 1
/Type /Pages
>>
endobj 
2 0 obj 
<<
/Rotate 0
/Parent 1 0 R
/Resources 3 0 R
/MediaBox [0 0 612 792]
/Contents [4 0 R]
/Type /Page
>>
endobj 
3 0 obj 
<<
/Font 
<<
/F0 
<<
/BaseFont /Times-Italic
/Subtype /Type1
/Type /Font
>>
>>
>>
endobj 
4 0 obj 
<<
/Length 65 
>>
stream
1. 0. 0. 1. 50. 700. cm
BT
  /F0 36. Tf
  (Hello, World!) Tj
ET 
endstream 
endobj 
5 0 obj 
<<
/Pages 1 0 R
/Type /Catalog
>>
endobj
xref  Cross-reference ...

Get PDF Explained now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.