November 2019
Intermediate to advanced
346 pages
9h 36m
English
In this section, we will see how to featurize PDF files in order to use them for machine learning. The tool we will be utilizing is the PDFiD Python script designed by Didier Stevens (https://blog.didierstevens.com/). Stevens selected a list of 20 features that are commonly found in malicious files, including whether the PDF file contains JavaScript or launches an automatic action. It is suspicious to find these features in a file, hence, the appearance of these can be indicative of malicious behavior.
Essentially, the tool scans through a PDF file, and counts the number of occurrences of each of the ~20 features. A run of the tool appears as follows:
PDFiD 0.2.5 PythonBrochure.pdf PDF Header: %PDF-1.6 obj 1096 endobj ...