Chapter 4. Dredging up History

Parse PDF files to locate data that's otherwise nearly inaccessible. The web is full of PDF files, many of which contain valuable intelligence. The problem is extracting this intelligence in a form where we can analyze it. Some PDF text can be extracted with sophisticated parsers. At other times, we have to resort to Optical Character Recognition (OCR) because the PDF is actually an image created with a scanner. How can we leverage information that's buried in PDFs?

In some cases, we can use a save as text option to try and expose the PDF content. We then replace a PDF parsing problem with a plain-text parsing problem. While PDFs can seem dauntingly complex, the presence of exact page coordinates can actually simplify ...

Get Python for Secret Agents - Volume II now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.