How it works…

We start by preparing a dataset consisting of raw emails (Step 1), which the reader can examine by looking at the dataset. In Step 2, we specify the paths of the spam and ham emails, as well as assign labels to their directories. We proceed to read all of the emails into an array, and create a labels array in Step 3. Next, we train-test split our dataset (Step 4), and then fit an NLP pipeline on it in Step 5. Finally, in Step 6, we test our pipeline. We see that accuracy is pretty high. Since the dataset is relatively balanced, there is no need to use special metrics to evaluate success.

Get Machine Learning for Cybersecurity Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.