book

Java: Data Science Made Easy

by Richard M. Reese, Jennifer L. Reese, Alexey Grigorev

July 2017

Beginner to intermediate

715 pages

17h 3m

English

Packt Publishing

Read now

Unlock full access

Content preview from Java: Data Science Made Easy

Handling PDF files

There are several APIs supporting the extraction of text from a PDF file. Here we will use PDFBox. The Apache PDFBox (https://pdfbox.apache.org/) is an open source API that allows Java programmers to work with PDF documents. In this section we will illustrate how to extract simple text from a PDF document. Javadocs for the PDFBox API is found at https://pdfbox.apache.org/docs/2.0.1/javadocs/.

This is a simple PDF file. It consists of several bullets:

Line 1
Line 2
Line 3

This is the end of the document.

A try block is used to catch IOExceptions. The PDDocument class will represent the PDF document being processed. Its load method will load in the PDF file specified by the File object:

try {  PDDocument document = ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781788475655Supplemental Content

Java: Data Science Made Easy

by Richard M. Reese, Jennifer L. Reese, Alexey Grigorev

Handling PDF files

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Java Data Science Cookbook

Java for Data Science

Mastering Java for Data Science

Numerical Methods Using Java: For Data Science, Analysis, and Engineering

Publisher Resources