July 2017
Beginner to intermediate
715 pages
17h 3m
English
There are several APIs supporting the extraction of text from a PDF file. Here we will use PDFBox. The Apache PDFBox (https://pdfbox.apache.org/) is an open source API that allows Java programmers to work with PDF documents. In this section we will illustrate how to extract simple text from a PDF document. Javadocs for the PDFBox API is found at https://pdfbox.apache.org/docs/2.0.1/javadocs/.
This is a simple PDF file. It consists of several bullets:
This is the end of the document.
A try block is used to catch IOExceptions. The PDDocument class will represent the PDF document being processed. Its load method will load in the PDF file specified by the File object:
try { PDDocument document = ...