Skip to Content
Data Wrangling with Python
book

Data Wrangling with Python

by Jacqueline Kazil, Katharine Jarmul
February 2016
Beginner to intermediate
508 pages
12h 27m
English
O'Reilly Media, Inc.
Content preview from Data Wrangling with Python

Chapter 5. PDFs and Problem Solving in Python

Publishing data only in PDFs is criminal, but sometimes you don’t have other options. In this chapter, you are going to learn how to parse PDFs, and in doing so you will learn how to troubleshoot your code.

We will also cover how to write a script, starting with some basic concepts like imports, and introduce some more complexity. Throughout this chapter, you will learn a variety of ways to think about and tackle problems in your code.

Avoid Using PDFs!

The data used in this section is the same data as in the previous chapter, but in PDF form. Normally, one does not seek data in difficult-to-parse formats, but we did for this book because the data you need to work with may not always be in the ideal format. You can find the PDF we use in this chapter in the book’s GitHub repository.

There are a few things you need to consider before you start parsing PDF data:

  • Have you tried to find the data in another form? If you can’t find it online, try using a phone or email.

  • Have you tried to copy and paste the data from the document? Sometimes, you can easily select, copy, and paste data from a PDF into a spreadsheet. This doesn’t always work, though, and it is not scalable (meaning you can’t do it for many files or pages quickly).

If you can’t avoid dealing with PDFs, you’ll need to learn how to parse your data with Python. Let’s get started.

Programmatic Approaches to PDF Parsing

PDFs are more difficult to work with than Excel files ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Wrangling with Python

Data Wrangling with Python

Dr. Tirthajyoti Sarkar, Shubhadeep Roychowdhury
Python for Data Analytics

Python for Data Analytics

O'Reilly Media, Inc.

Publisher Resources

ISBN: 9781491948804Errata PageSupplemental Content