Creating a dataset loader

As we are looking for authorship information, we only want the e-mails we can attribute to a specific author. For that reason, we will look in each user's sent folder—that is, emails they have sent. We can now create a function that will choose a couple of authors at random and return each of the emails in their sent folder. Specifically, we are looking for the payloads—that is, the content rather than the e-mails themselves. For that, we will need an e-mail parser. The code is as follows:

from email.parser import Parser p = Parser()

We will be using this later to extract the payloads from the e-mail files that are in the data folder.

With our data loading function, we are going to have a lot of options. Most of ...

Get Learning Data Mining with Python - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.