After a hand-calculating spam email detection example, as promised, we are going to code it through a genuine dataset, taken from the Enron email dataset http://www.aueb.gr/users/ion/data/enron-spam/. The specific dataset we are using can be directly downloaded via http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz. You can either unzip it using a software or run the command line tar -xvz enron1.tar.gz in the Terminal. The uncompressed folder includes a folder of ham email text files and a folder of spam email text files, as well as a summary description of the database:
enron1/ ham/ 0001.1999-12-10.farmer.ham.txt 0002.1999-12-13.farmer.ham.txt ...... ...... 5172.2002-01-11.farmer.ham.txt ...