You can refer to these steps to load the dataset:
- If you downloaded the latest code from GitHub, you will find several .zip files in the notebooks/data/chapter7 directory. These files contain raw email data (with fields for To:, Cc:, and text body) that are either classified as spam (with the SPAM = 1 class label) or not (also known as ham, the HAM = 0 class label).
- We build a variable called sources, which contains all of the raw data files:
In [1]: HAM = 0... SPAM = 1... datadir = 'data/chapter7'... sources = [... ('beck-s.tar.gz', HAM),... ('farmer-d.tar.gz', HAM),... ('kaminski-v.tar.gz', HAM),... ('kitchen-l.tar.gz', HAM),... ('lokay-m.tar.gz', HAM),... ('williams-w3.tar.gz', HAM),... ('BG.tar.gz', SPAM),... ('GP.tar.gz', ...