Extracting PDF content

In Chapter 1, New Missions – New Tools, we installed PDF Miner 3K to parse PDF files. It's time to see how this tool works. Here's the link to the documentation for this package: http://www.unixuser.org/~euske/python/pdfminer/index.html. This link is not obvious from the PyPI page, or from the BitBucket site that contains the software. An agent who scans the docs/index.html will see this reference.

In order to see how we use this package, visit http://www.unixuser.org/~euske/python/pdfminer/programming.html. This has an important diagram that shows how the various classes interact to represent the complex internal details of a PDF document. For some helpful insight, visit http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/ ...

Get Python for Secret Agents - Volume II now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Python for Secret Agents - Volume II by Steven Lott

Extracting PDF content

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly