After a hand-calculating spam email detection example, as promised, we are going to code it through a genuine dataset, taken from the Enron email dataset http://www.aueb.gr/users/ion/data/enron-spam/. The specific dataset we are using can be directly downloaded via http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz. You can either unzip it using software, or run the following command line on your terminal:
tar -xvz enron1.tar.gz
The uncompressed folder includes a folder of ham, or non-spam, email text files, and a folder of spam email text files, as well as a summary description of the database:
enron1/ ham/ 0001.1999-12-10.farmer.ham.txt 0002.1999-12-13.farmer.ham.txt …… …… 5172.2002-01-11.farmer.ham.txt ...