How to do it...

In the following steps, you will featurize a collection of PDF files using the PDFiD script:

  1. Download the tool and place all accompanying code in the same directory as featurizing PDF Files.ipynb.
  2.  Import IPython's io module so as to capture the output of an external script:
from IPython.utils import io
  1. Define a function to featurize a PDF:
def PDF_to_FV(file_path):    """Featurize a PDF file using pdfid."""
  1. Run pdfid against a file and capture the output of the operation:
     with io.capture_output() as captured:         %run -i pdfid $file_path     out = captured.stdout
  1. Next, parse the output so that it is a numerical vector:
    out1 = out.split("\n")[2:-2]    return [int(x.split()[-1]) for x in out1]
  1. Import listdir to enumerate the files ...

Get Machine Learning for Cybersecurity Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.