Data preparation

Now that we have clearly stated and defined the problem that we are going to solve with ML, we need the data. No data, no ML. Typically, you need to take an extra step prior to the data preparation step to collect and gather the data that you need, but in this book we are going to use a pre-compiled and labeled dataset that is publicly available. In this chapter, we are going to use the CSDMC2010 SPAM corpus dataset ( to train and test our models. You can follow the link and download the compressed data at the bottom of the web page. When you have downloaded and decompressed the data, you will see two folders named TESTING and TRAINING, and a text file named SPAMTrain.label ...

