As we are looking for authorship information, we only want the e-mails we can attribute to a specific author. For that reason, we will look in each user's sent folder—that is, emails they have sent. We can now create a function that will choose a couple of authors at random and return each of the emails in their sent folder. Specifically, we are looking for the payloads—that is, the content rather than the e-mails themselves. For that, we will need an e-mail parser. The code is as follows:
from email.parser import Parser p = Parser()
We will be using this later to extract the payloads from the e-mail files that are in the data folder.
With our data loading function, we are going to have a lot of options. Most of ...