July 2017
Intermediate to advanced
254 pages
6h 29m
English
Our first problem is a modern version of the canonical binary classification problem: spam filtering. In our version, however, we will classify spam and ham SMS messages rather than e-mail. We will extract tf-idf features from the messages using the techniques we learned in previous chapters, and classify the messages using logistic regression. We will use the SMS Spam Collection Data Set from the UCI Machine Learning Repository. The dataset can be downloaded from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. First, let's explore the dataset and calculate some basic summary statistics using pandas:
# In[1]:import pandas as pddf = pd.read_csv('./SMSSpamCollection', delimiter='t', header=None)print(df.head())# Out[1]: ...Read now
Unlock full access