Extracting PDF content
In Chapter 1, New Missions – New Tools, we installed PDF Miner 3K to parse PDF files. It's time to see how this tool works. Here's the link to the documentation for this package: http://www.unixuser.org/~euske/python/pdfminer/index.html. This link is not obvious from the PyPI page, or from the BitBucket site that contains the software. An agent who scans the docs/index.html
will see this reference.
In order to see how we use this package, visit http://www.unixuser.org/~euske/python/pdfminer/programming.html. This has an important diagram that shows how the various classes interact to represent the complex internal details of a PDF document. For some helpful insight, visit http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/ ...
Get Python for Secret Agents - Volume II now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.