Loading the dataset

If you downloaded the latest code from GitHub, you will find a number of .zip files in the notebooks/data/chapter7 directory. These files contain raw email data (with fields for To:, Cc:, and text body) that are either classified as spam (with the SPAM = 1 class label) or not (also known as ham, the HAM = 0 class label).

We build a variable called sources, which contains all the raw data files:

In [1]: HAM = 0...     SPAM = 1...     datadir = 'data/chapter7'...     sources = [...        ('beck-s.tar.gz', HAM),...        ('farmer-d.tar.gz', HAM),...        ('kaminski-v.tar.gz', HAM),...        ('kitchen-l.tar.gz', HAM),...        ('lokay-m.tar.gz', HAM),...        ('williams-w3.tar.gz', HAM),...        ('BG.tar.gz', SPAM),...        ('GP.tar.gz', SPAM),...        ('SH.tar.gz', SPAM)...     ]

The ...

Get Machine Learning for OpenCV now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.