Mastering Large Datasets with Python

Chapter 8. Best practices for large data with Apache Streaming and mrjob

This chapter covers

Using JSON to transfer complex data structures between Apache Streaming steps
Writing mrjob scripts to interact with Hadoop without Apache Streaming
Thinking about mappers and reducers as key-value consumers and producers
Analyzing web traffic logs and tennis match logs with Apache Hadoop

In chapter 7, we learned about two distributed frameworks for processing large datasets: Hadoop and Spark. In this chapter, we’ll dive deep into Hadoop—the Java-based large dataset processing framework. As we touched on last chapter, Hadoop has a lot of benefits. We can use Hadoop to process

lots of data fast—distributed parallelization
data that’s important—low ...

Get Mastering Large Datasets with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Large Datasets with Python by John Wolohan

Chapter 8. Best practices for large data with Apache Streaming and mrjob

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly