Feature engineering for email data

We briefly looked at word distributions for spam and ham emails in the previous step and there are a couple things that we noticed. First, a large number of the most frequently occurring words are commonly used words with out much meaning. For example, words like to, the, for, and a are commonly used words and our ML algorithms would not learn much from these words. These type of words are called stop words and are often ignored or dropped from the feature set. We will use NLTK's list of stop words to filter out commonly used words from our feature set. You can download the NLTK list of stop words from here: https://github.com/yoonhwang/c-sharp-machine-learning/blob/master/ch.2/stopwords.txt. One way to ...

Get C# Machine Learning Projects now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.