Now that we have extracted the blog posts, we can train our Naive Bayes model on them. The intuition is that we record the probability of a word being written by a particular gender, and record these values in our model. To classify a new sample, we would multiply the probabilities and find the most likely gender.
The aim of this code is to output a file that lists each word in the corpus, along with the frequencies of that word for each gender. The output file will look something like this:
"'ailleurs" {"female": 0.003205128205128205}"'air" {"female": 0.003205128205128205}"'an" {"male": 0.0030581039755351682, "female": 0.004273504273504274}"'angoisse" {"female": 0.003205128205128205}"'apprendra" {"male": 0.0013047113868622459, ...