Chapter 2. Data

This chapter introduces the dataset we will work on in the rest of the book: your own email inbox. It will also cover the kinds of tools we’ll be using, and our reasoning for doing so. Finally, it will outline multiple perspectives we’ll use in analyzing data for you to think about moving forward.

The book starts with data because in Agile Big Data, our process starts with the data.


If you do not have a Gmail account, you will need to create one (at and populate it with some email messages in order to complete the exercises in this chapter.


Email is a fundamental part of the Internet. More than that, it is foundational, forming the basis for authentication for the Web and social networks. In addition to being abundant and well understood, email is complex, is rich in signal, and yields interesting information when mined.

We will be using your own email inbox as the dataset for the application we’ll develop in order to make the examples relevant. By downloading your Gmail inbox and then using it in the examples, we will immediately face a “big” or actually, a “medium” data problem—processing the data on your local machine is just barely feasible. Working with data too large to fit in RAM this way requires that we use scalable tools, which is helpful as a learning device. By using your own email inbox, we’ll enable insights into your own little world, helping you see which techniques are effective! This is cultivating data intuition, a ...

Get Agile Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.