November 2019
Intermediate to advanced
346 pages
9h 36m
English
Unlike the previous recipe, in which we analyzed a single file's N-grams, in this recipe, we look at a large collection of files to understand which N-grams are the most informative features. We start by specifying the folders containing our samples, our value of N, and import some modules to enumerate files (step 1). We proceed to count all N-grams from all files in our dataset (step 2). This allows us to find the globally most frequent N-grams. Of these, we filter down to the K1=1000 most frequent ones (step 3). Next, we introduce a helper method, featurizeSample, to be used to take a sample and output the number of appearances of the K1 most common N-grams in its byte sequence (step 4). We then iterate through our directories ...