The Portable Document Format (PDF) is the world’s leading language for describing the printed page, and the first one equally suitable for paper and online use. In this chapter, we take a tour of its uses, features, and history. We look at some useful free software and resources, some of which we’ll use later in this book.
Today we take the high fidelity exchange of documents for granted, knowing that a document sent here will appear the same there and vice versa, and that it may be displayed equally on screen and on paper. This was not always so.
We could pass documents between users, and from user to printer, as a series of bitmap pictures (e.g., TIFF or PNG), one for each page. However, this doesn’t allow for any structure to be retained, precludes scaling to different paper sizes or resolutions without loss of quality, involves huge file sizes, and so on.
A page description language like PDF is way of describing the contents (text and graphics) of a printed or onscreen page using highly structured data, often with extra metadata describing various aspects of the document (such as printing information or textual annotations or how it is to be viewed or printed). This way, decisions about how the document is rasterized (converted to pixels by a printer or on screen) can be left until the end of the production process. A PDF file can contain text and associated font definitions, vector and bitmap graphics, navigation (such as hyperlinks and bookmarks), and interactive forms.
PDF is used wherever the exact presentation of the content is important (for example for a print advertisement or book). It isn’t normally suitable when the content is to be layed out or reflowed at the last moment, such as in a variable width web page—languages like HTML and CSS which separate content from presentation are more suitable in those circumstances.
Many page description languages were created when the printing of lines of text in fixed fonts began to be replaced by digital graphics printing. The printer would then process the language to generate a bitmap at the appropriate resolution. For example, PostScript (Adobe), PCL (Hewlett Packard), and KPDL (Kyocera). Simpler languages were used for vector plotters (for example, HPGL from Hewlett Packard).
These languages varied in complexity and functionality. PostScript files, for example, are full programs—the result of executing the program is the document’s visual representation. These languages often contain extra instructions to control aspects of the document other than the page content, for example which tray paper is drawn from or whether the output is to be duplexed.
PDF began as an internal project at Adobe to create a platform-neutral method for document interchange. PostScript was already popular in the print community, but wasn’t practical for on screen use with the computers of the day—especially for random access (to render page 50 of a PostScript document, one must process pages 1–49 first). The idea was to use a subset of the PostScript graphics language together with ancillary data to create a structured language for standalone documents to be viewed on (or printed from) any computer.
PDF 1.0 was announced in 1993, with Acrobat Distiller (for creating and editing PDF files) and Acrobat Reader (for viewing only), both as paid-for programs. The US Tax Authorities started to ship tax forms as PDFs, purchasing a license to allow their users to download Acrobat Reader for free. Later on, Acrobat Reader was made available to everybody at no cost, leading to the widespread use of PDF for the exchange of documents online.
Over the next 10 years, after a slow start as prepress features were added, PDF overtook PostScript as the language of choice in the printing industry. Today, it is the only general page description language of note.
When a number of formats compete to be the industry standard, the best contender is not always the victor—luck can intervene. In this case, though, PDF had a number of singular advantages. We look at some of them here.
Unlike PostScript, any object (page, graphic etc.) in a PDF document can be accessed at will, in constant time. This means it’s no harder to read page 150 than page 1. Linearization is the process of arranging the objects in the file such that all those needed for a given page are located in adjacent positions. This explains why you can quickly jump to any page in a PDF being viewed in Acrobat Reader in a web browser window—the viewer doesn’t need to load the whole file to begin with, it fetches from the server just the sections needed to display each new page.
Stream creation is the ability inherent in the PDF format to allow files to be created in order, from beginning to end, even if the eventual file is larger than the memory available.
Incremental update means that, when editing a file, it’s possible to write the changes to the end of the file without modifying any existing part—this makes saving changed versions very fast, and can be used to provide an undo mechanism (since the previous version is still intact).
Fonts used in a PDF are embedded along with the document. This means that it should always be rendered correctly, regardless of which fonts are installed on a given computer. The program creating the PDF document will remove unnecessary data from the font (such as metrics and unused characters), so the file does not become unduly large. PDF supports all common font formats, such as TrueType and Type 1.
Most PDF files maintain the information to map the character shapes making up the text to Unicode character codes. This means that you can copy and paste text from a document, or search the text easily. More recent developments in PDF allow the logical order of the text in the document to be stored separately from the layout of the text on the page, preserving yet more structured information.
PDF was released as an open standard by the International Organization for Standardization (ISO) in 2008. The ISO-32000-1:2008 document is largely the same as the PDF file format document previously released by Adobe.
This independence lends legitimacy and oversight to the PDF standard, which should encourage its further adoption. However, with no real tools for detecting whether a file meets the standard (Adobe Reader will happily load malformed files, so many tools create them), genuine rigor is some time away.
There are several specialized variations on the PDF format—both standardized, and in development. These are subsets of the PDF format. Each file is a valid PDF document, but with restrictions on the facilities used or the content itself. Two of these, PDF/A and PDF/X, are now ISO standards.
The PDF/A Standard (ISO 19005-1:2005) defines a set of rules for documents intended for long-term archiving in libraries, national archives and bureaucracies. It also requires a “conforming reader” to act in certain ways, using the embedded fonts, using color management, and so forth. Briefly, the restrictions on PDF/A are:
All fonts to be embedded
Metadata is required
Device-independent color spaces only
No audio or video content
There are two levels of PDF/A compliance: PDF/A-1b (“level B compliance”) requires exact visual reproduction of the document. PDF/A-1a (“level A compliance”) requires that text can be mapped to Unicode, and that the order and structure of the text is documented, in addition to the requirement of exact visual reproduction.
The PDF/A Competence Center is an industry group representing PDF/A stakeholders. A second ISO version of PDF/A is in preparation.
The PDF/X Standard is a family of ISO standards for graphics exchange in the printing industry, the latest of which is PDF/X-5 (ISO 15930-8:2010). It defines a number of restrictions:
All fonts must be embedded
All image data must be embedded
Cannot contain sound, films or non-printable annotations
Limited compression algorithms
and a number of extra requirements:
The file is marked as PDF/X with the subversion (e.g., PDF/X-5)
Bleed, trim and/or art boxes are required, in addition to the normal page size. These boxes define the size of the media, the printable area, the final cut size, and so on.
A flag is set if the file has been trapped. Trapping is the process of creating small overlaps between graphical objects to mask registration problems in multiple color printing processes.
The file must contain an output intent, containing a color profile describing how it is to be printed.
PDF is fully backward compatible (you can load a PDF version 1.0 document into a program designed for PDF 1.7) and mostly forward compatible (programs written for PDF 1.0 can normally load PDF 1.7 files). Forward compatibility is ensured because readers ignore content they don’t understand—it’s only when new compression methods or object storage mechanisms are introduced that this may be broken. Since PDF 1.5 in 2003, such changes have been minimal. PDF versions and their features are summarized in Table 1-1.
Table 1-1. Functionality in PDF versions 1.0 to 1.7 Extension Level 8
|PDF version||Acrobat Reader version||Launched||Summary of new features|
|1.1||2.0||1996||Device independent color spaces, encryption (40-bit), article threads, named destinations, and hyperlinks.|
|1.2||3.0||1996||AcroForms (interactive forms), films, and sounds, more compression methods, Unicode support.|
|1.3||4.0||2000||More color spaces, embedded (attached) files, digital signatures, annotations, masked images, gradient fills, logical document structure, prepress support.|
|1.4||5.0||2001||Transparency, 128-bit encryption, better form support, XML metadata streams, tagged PDF, JBIG2 compression.|
|1.5||6.0||2003||Object streams and cross-reference streams for more compact files, JPEG 2000 support, XFA forms, public-key encryption, custom encryption methods, optional content groups.|
|1.6||7.0||2004||OpenType fonts, 3D content, AES encryption, new color spaces.|
|1.7 (later ISO 32000-1:2008)||8.0||2006||XFA 2.4, new kinds of string, extensions to public-key architecture.|
|1.7 Extension Level 3||9.0||2008||256-bit AES encryption.|
|1.7 Extension Level 5||9.1||2009||XFA 3.0.|
|1.7 Extension Level 8||X||2011||Not yet known.|
A typical PDF file contains many thousands of objects, multiple compression mechanisms, different font formats, and a mixture of vector and raster graphics together with a wide variety of metadata and ancillary content. We take a brief tour of these elements here, for context—they are covered more fully in later chapters.
A PDF file can contain text drawn from multiple fonts of all popular formats (Type1, TrueType, OpenType, legacy bitmap fonts etc). Font files are embedded in the document, so the character shapes are always available, meaning the file should render the same on any computer. A variety of character encodings are supported, including Unicode.
Text can be filled with any color, pattern, or transparency. A piece of text may be used as a shape to clip other content, allowing complicated graphical effects whilst text remains selectable and editable.
Typically, enough information is encoded in a PDF document to allow text extraction, though the process is not always straightforward.
Graphical content in PDF is based on the model first used in Adobe’s PostScript language. It consists of paths built from straight lines and curves. Each path may be filled, “stroked” to draw a line, or both. Lines can have varying thicknesses, join styles and dash patterns.
Paths may be filled in any color, with a repeating pattern defined by other objects, or with a smooth gradient between two colors. All these options apply also to the lines of stroked paths.
Paths can be rendered using a variety of plain or gradient transparencies, with several different blend modes defining how semitransparent objects interact. Objects may be grouped together for the purposes of transparency, so a single transparency can be applied to a whole group of objects at once.
Paths can be used to clip other objects, so that only sections of those objects overlapping with the clipping path are shown. These clipping regions may be nested within one another.
PDF has a mechanism which allows a graphic to be defined once and then used multiple times in different contexts. This can be used, for instance, for a recurring motif, even across more than one page.
PDF documents can include bitmap images between 1 and 16 bits per component, in several color spaces (for example, three-component RGB or four-component CMYK). Images can be compressed using a variety of lossless and lossy compression mechanisms.
Images may be placed at any scale or rotation, used to create a fill pattern, and may have a mask which defines how they use transparency to blend with the background they are placed on.
PDF can use color spaces related to particular electronic or print devices (grayscale, RGB, CMYK) and ones related to human color perception. In addition, there are color spaces for the printing industry such as spot colors. Mechanisms exist for simpler PDF programs (like onscreen viewers) to fall back to basic color spaces if they do not support the more advanced ones.
PDF documents have a set of standard metadata, such as title, author, keywords and so on. These are defined outside the graphical content and have no effect on the document when viewed. The creator (the program which created the content) and producer (the program that wrote the PDF file) are also recorded. Each document also has a set of unique identifiers, allowing them to be tracked through a workflow.
Since PDF 1.4, the metadata can be stored in an XML (eXtensible Markup Langauge) document embedded in the PDF using Adobe’s Extensible Metadata Platform (XMP). This defines a way to store metadata for objects in the PDF which can be extended by third parties to hold information relevant to their particular workflows or products.
PDF documents have two methods of navigation, when viewed on screen:
The document outline, commonly known as the document’s bookmarks, is a structured list of destinations within the document, shown alongside it. Clicking on one moves the view to that page or position.
Hyperlinks within the text or graphics of a document allow the user to click to move elsewhere within the document, or to open an external URL.
Optional content groups in PDF allow parts of the content of a page to be grouped together and shown—or not shown—based on some other factor (user choice, whether the document is on screen or printed, the zoom factor). Relationships between groups can be defined, so that they depend upon one another. One use for this is to emulate the “layers” found in graphics packages. For example, Adobe Illustrator layers are preserved when a document it produces is read with a PDF viewer.
PDF documents can include various kinds of multimedia elements. A lot of this breaks the portability inherent in PDF, and is often not well supported outside of Adobe products.
Sounds and movies can be embedded.
Slide shows can be defined, to move automatically between pages with transition effects.
A more general system for including arbitrary media types was introduced.
3D Artwork can be embedded.
There are two incompatible forms architectures in PDF: AcroForms, which is an open standard, and the Adobe XML Forms Architecture (XFA), which is documented but requires commercial software from Adobe.
Logical structure facilities allow information about the structural content (chapters, sections, figures, tables, and footnotes) to be included alongside the graphical content. The particular elements are customizable by third parties.
A tagged PDF is one which has logical structure based on a set of Adobe-defined elements. Files following these conventions can be reflowed by a reader to display the same text in a different page size or text size, for example in an ebook reader.
PDF documents can be encrypted for security, using RC4 or AES encryption methods. There are two passwords—the owner password and the user password. The owner password unlocks the file for all changes, the user password just allows a range of operations selected by the owner when the file was originally encrypted (for example, allowing or disallowing printing or text extraction). Frequently the user password is blank, so the file appears to open as normal, but functionality is restricted.
Starting with PDF 1.3, digital signatures can be used to authenticate the identity of a user or the contents of the document.
Images and other data streams in PDF can be compressed using a variety of lossless and lossy methods defined by third parties. By compressing only these streams (rather than the whole file), the structure of the PDF objects is always available without decompressing the whole file, and compressed sections can be processed only when needed. There are several groups of compression methods:
Lossless compression for bi-level (e.g., black and white) images. PDF supports the standard fax encoding methods for bi-level images and, from PDF 1.4, the JBIG2 standard, which provides better compression for the same class of images.
Lossy image filters such as JPEG and, from PDF 1.5, JPEG2000.
Lossless compression mechanisms suitable for image data and general data compression, such as Flate (The zip algorithm), Lempel-Ziv-Welch (LZW) and run length encoding.
PDF is used in a wide variety of industries and professions. We describe some here, explaining why PDF is suitable for each.
PDF has support for the color spaces, page dimension information (such as media, crop, art and bleed boxes), trapping support, and resolution-independence required for commercial printing. Together with other technologies, PDF is the key part of the publishing-for-print workflow. The extensibility of PDF metadata allows various schemes for including extra data along with the document, and for keeping it with the document throughout the publishing process—parts of the workflow which don’t understand a particular piece of metadata will at least preserve it.
This book was created using the DocBook system, which takes a structured document in XML format, typesets it, and produces a PDF complete with hyperlinks and bookmarks, together with a more traditional PDF suitable for printing.
PDF is one of the competing eBook formats. To support display on a wide range of screens, PDF documents may be tagged with reflow information, allowing lines of text to be displayed at differing widths on each device. This is at odds with the other uses of PDF, where fixed text layout is a requirement.
PDF forms are especially useful when existing paper-based systems are being transitioned to electronic ones, or must exist alongside them. A PDF form (filled in online then printed out) looks the same as one filled in manually on paper, and may be processed by existing human and computer systems in the same way.
Through PDF/A, PDF is the ideal format for long-term archiving, combining accurate representations of scanned and electronic content, together with Unicode language support, and compression mechanisms for all sorts of data including the important CCITT Fax and JBIG2 methods for monochrome images. Being an ISO standard (and one which is near-ubiquitous) guarantees that these documents can be read long into the future.
PDF can be used for Optical Character Recognition (OCR), allowing searchable text to be created from the original, the exact visual representation being retained alongside the recognized text.
PDF is not, at first sight, suitable for use as an editable vector graphics format. For example, a circle won’t remain editable as a circle, since it will have been converted to a number of curves (there is no circle element in PDF).
However, if appropriate use is made of its extensibility to store auxiliary data, it makes a good solution. Adobe Illustrator, for example, now uses an extended form of PDF as its file format. The file can be viewed in any PDF viewer but Illustrator can make use of the extended data when it is loaded back into the program.
In this book, we use various pieces of software to help us with examples. Luckily, everything you need is freely available. You’ll need a PDF viewer:
Acrobat Reader is Adobe’s own PDF viewer. It supports all versions and features of PDF and comes with a browser plug-in on most platforms. It’s available for Microsoft Windows, Mac OS X, Linux, Solaris, and Android.
Preview is the pre-installed PDF viewer and browser plug-in for PDF documents on Mac OS X. It’s highly capable, and very fast, but doesn’t support everything that Acrobat Reader does. Many people stick with Preview as the default application for PDF files, but install Acrobat Reader as well.
Xpdf is an open source PDF viewer for Unix. It supports a reasonable subset of PDF.
gv is a PostScript and PDF viewer frontend for GhostScript (see below). It can render the textual and graphical content of almost all documents. However, it lacks most of the interactive features of other PDF viewers.
There are two key command-line tools:
pdftk is a multiplatform command-line tool for processing PDF files in various ways. It can be downloaded in pre-built form for Microsoft Windows, Mac OS X, and Linux, as well as in source code form.
Ghostscript is a set of tools including an interpreter for PostScript and PDF. It can be used to render PDF files, and to process them in various ways from the command line. It is available in binary form for Microsoft Windows, and in source code form for all platforms.
A full discussion of Adobe and open-source PDF software is in Chapter 10.