Chapter 8. Best practices for large data with Apache Streaming and mrjob

This chapter covers

  • Using JSON to transfer complex data structures between Apache Streaming steps
  • Writing mrjob scripts to interact with Hadoop without Apache Streaming
  • Thinking about mappers and reducers as key-value consumers and producers
  • Analyzing web traffic logs and tennis match logs with Apache Hadoop

In chapter 7, we learned about two distributed frameworks for processing large datasets: Hadoop and Spark. In this chapter, we’ll dive deep into Hadoop—the Java-based large dataset processing framework. As we touched on last chapter, Hadoop has a lot of benefits. We can use Hadoop to process

  • lots of data fast—distributed parallelization
  • data that’s important—low ...

