Loading the dataset

You can refer to these steps to load the dataset:

  1. If you downloaded the latest code from GitHub, you will find several .zip files in the notebooks/data/chapter7 directory. These files contain raw email data (with fields for To:, Cc:, and text body) that are either classified as spam (with the SPAM = 1 class label) or not (also known as ham, the HAM = 0 class label).
  2. We build a variable called sources, which contains all of the raw data files:
In [1]: HAM = 0...     SPAM = 1...     datadir = 'data/chapter7'...     sources = [...        ('beck-s.tar.gz', HAM),...        ('farmer-d.tar.gz', HAM),...        ('kaminski-v.tar.gz', HAM),...        ('kitchen-l.tar.gz', HAM),...        ('lokay-m.tar.gz', HAM),...        ('williams-w3.tar.gz', HAM),...        ('BG.tar.gz', SPAM),... ('GP.tar.gz', ...

Get Machine Learning for OpenCV 4 - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.