Training Naive Bayes

Now that we have extracted the blog posts, we can train our Naive Bayes model on them. The intuition is that we record the probability of a word being written by a particular gender, and record these values in our model. To classify a new sample, we would multiply the probabilities and find the most likely gender.

The aim of this code is to output a file that lists each word in the corpus, along with the frequencies of that word for each gender. The output file will look something like this:

"'ailleurs" {"female": 0.003205128205128205}"'air" {"female": 0.003205128205128205}"'an" {"male": 0.0030581039755351682, "female": 0.004273504273504274}"'angoisse" {"female": 0.003205128205128205}"'apprendra" {"male": 0.0013047113868622459, ...

Get Learning Data Mining with Python - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.