Loading the dataset

If you downloaded the latest code from GitHub, you will find a number of .zip files in the notebooks/data/chapter7 directory. These files contain raw email data (with fields for To:, Cc:, and text body) that are either classified as spam (with the SPAM = 1 class label) or not (also known as ham, the HAM = 0 class label).

We build a variable called sources, which contains all the raw data files:

In [1]: HAM = 0...     SPAM = 1...     datadir = 'data/chapter7'...     sources = [...        ('beck-s.tar.gz', HAM),...        ('farmer-d.tar.gz', HAM),...        ('kaminski-v.tar.gz', HAM),...        ('kitchen-l.tar.gz', HAM),...        ('lokay-m.tar.gz', HAM),...        ('williams-w3.tar.gz', HAM),...        ('BG.tar.gz', SPAM),...        ('GP.tar.gz', SPAM),...        ('SH.tar.gz', SPAM)...     ]

The ...

Get Machine Learning for OpenCV now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.