Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called
pdf2txt that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly.
Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command:
pip install pdfminer
This will install the entire pdfMiner package and all its associated command-line tools.