Chapter 15. Big data and MapReduce


This chapter covers
  • MapReduce
  • Using Python with Hadoop Streaming
  • Automating MapReduce with mrjob
  • Training support vector machines in parallel with the Pegasos algorithm


I often hear “Your examples are nice, but my data is big, man!” I have no doubt that you work with data sets larger than the examples used in this book. With so many devices connected to the internet and people interested in making data-driven decisions, the amount of data we’re collecting has outpaced our ability to process it. Fortunately, a number of open source software projects allow us to process large amounts of data. One project, called Hadoop, is a Java framework for distributing data processing to multiple ...

Get Machine Learning in Action now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.