How to do it...

In the following steps, you will featurize a collection of PDF files using the PDFiD script:

  1. Download the tool and place all accompanying code in the same directory as featurizing PDF Files.ipynb.
  2.  Import IPython's io module so as to capture the output of an external script:
from IPython.utils import io
  1. Define a function to featurize a PDF:
def PDF_to_FV(file_path):    """Featurize a PDF file using pdfid."""
  1. Run pdfid against a file and capture the output of the operation:
     with io.capture_output() as captured:         %run -i pdfid $file_path     out = captured.stdout
  1. Next, parse the output so that it is a numerical vector:
    out1 = out.split("\n")[2:-2]    return [int(x.split()[-1]) for x in out1]
  1. Import listdir to enumerate the files ...

Get Machine Learning for Cybersecurity Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.