Chapter 8. Best practices for large data with Apache Streaming and mrjob

This chapter covers

  • Using JSON to transfer complex data structures between Apache Streaming steps
  • Writing mrjob scripts to interact with Hadoop without Apache Streaming
  • Thinking about mappers and reducers as key-value consumers and producers
  • Analyzing web traffic logs and tennis match logs with Apache Hadoop

In chapter 7, we learned about two distributed frameworks for processing large datasets: Hadoop and Spark. In this chapter, we’ll dive deep into Hadoop—the Java-based large dataset processing framework. As we touched on last chapter, Hadoop has a lot of benefits. We can use Hadoop to process

  • lots of data fast—distributed parallelization
  • data that’s important—low ...

Get Mastering Large Datasets with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.