CHAPTER 5Processing Large Datasets

The most popular social websites have hundreds of millions to more than a billion users. Logging the actions that users take, such as messaging, adding and removing network relationships, and clicking, rapidly amounts to terabytes (one thousand gigabytes) to petabytes (one million gigabytes) of data. To make a ballpark estimate, with 10 million users spending 10 minutes a day on your site, viewing 10 items per minute, you can expect to record to the logs approximately one billion items a day. If every item costs approximately 1000 bytes to record, you can expect approximately 1 terabyte (TB) of data per day. This rough estimate may even underestimate by an order of magnitude or more if you consider server systems logs. Clearly, keeping any analysis going that can keep up with such a rate of data requires considerable computation.

A single processor cannot keep up with the flow of Internet-scale social data. Analysis must rely on sampling, approximate, or parallel algorithms. This chapter focuses on MapReduce, a model of parallel computation that enables you to structure your problems so that you may harness hundreds or even thousands of computers to process your data faster than it accumulates. The chapter also introduces algorithms to carry out computations with fewer resources than one would see from traditional approaches, while delivering approximations to the expected results within guaranteed bounds.

MapReduce: Structuring Parallel ...

Get Social Media Data Mining and Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.