In the following steps, you will featurize a collection of PDF files using the PDFiD script:
- Download the tool and place all accompanying code in the same directory as featurizing PDF Files.ipynb.
- Import IPython's io module so as to capture the output of an external script:
from IPython.utils import io
- Define a function to featurize a PDF:
def PDF_to_FV(file_path): """Featurize a PDF file using pdfid."""
- Run pdfid against a file and capture the output of the operation:
with io.capture_output() as captured: %run -i pdfid $file_path out = captured.stdout
- Next, parse the output so that it is a numerical vector:
out1 = out.split("\n")[2:-2] return [int(x.split()[-1]) for x in out1]
- Import listdir to enumerate the files ...