The algorithm

We use the function list_words() to get a list of unique words with more than three characters in lower case:

def list_words(text): 
    words = [] 
    words_tmp = text.lower().split() 
    for w in words_tmp: 
        if w not in words and len(w) > 3: 
            words.append(w) 
    return words

Tip

For a more advanced term-document matrix, we can use the Python textmining package from:

https://pypi.python.org/pypi/textmining/1.0

The training() function creates variables to store the data needed for the classification. The c_words variable is a dictionary with the unique words and its number of occurrences in the text (frequency) by category. The c_categories variable stores a dictionary of each category and its number of texts. Finally, c_text and c_total_words store the ...

Get Practical Data Analysis - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Practical Data Analysis - Second Edition by Hector Cuesta, Dr. Sampath Kumar

The algorithm

Tip

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly