How it works…

We start by preparing a dataset consisting of raw emails (Step 1), which the reader can examine by looking at the dataset. In Step 2, we specify the paths of the spam and ham emails, as well as assign labels to their directories. We proceed to read all of the emails into an array, and create a labels array in Step 3. Next, we train-test split our dataset (Step 4), and then fit an NLP pipeline on it in Step 5. Finally, in Step 6, we test our pipeline. We see that accuracy is pretty high. Since the dataset is relatively balanced, there is no need to use special metrics to evaluate success.

Get Machine Learning for Cybersecurity Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.