Chapter 12. Integrating Hadoop

Jeremy Hanna

As companies and organizations adopt technologies like Cassandra, they look for tools that can be used to perform analytics and queries against their data. The built-in ways to query can do much, along with custom layers atop that. However, there are distributed tools in the community that can be fitted to work with Cassandra as well.

Hadoop seems to be the elephant in the room when it comes to open source big data frameworks. There we find tools such as an open source MapReduce implementation and higher-level analytics engines built on top of that, such as Pig and Hive. Thanks to members of both the Cassandra and Hadoop communities, Cassandra has gained some significant integration points with Hadoop and its analytics tools.

In this chapter, we explore how Cassandra and Hadoop fit together. First, we give a brief history of the Apache Hadoop project and go into how one can write MapReduce programs against data in Cassandra. From there, we cover integration with higher-level tools built on top of Hadoop: Pig and Hive. Once we have an understanding of these tools, we cover how a Cassandra cluster can be configured to run these analytics in a distributed way. Finally, we share a couple of use cases where Cassandra is being used alongside Hadoop to solve real-world problems.

What Is Hadoop?

If you’re already familiar with Hadoop, you can safely skip this section. If you haven’t had the pleasure, Hadoop (http://hadoop.apache.org) is a set of open ...

Get Cassandra: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.