Now that you’ve ingested data into your Hadoop cluster, what’s next? Usually you’ll want to start by simply cleansing or transforming your data. This could be as simple or reformatting fields and removing corrupt records or it could involve all manner of complex aggregation, enrichment, and summarization. Once you’ve cleaned up your data, you may be satisfied to simply push it into a more traditional data store, such as a relational database, and consider your big data work to be done. On the other hand, you may want to continue to work with your data, running specialized machine-learning algorithms to categorize your data or perhaps performing some sort of geospatial analysis.
In this chapter, we’re going to talk about two types of tools:
General-purpose tools that make it easier to process your data
Focused-purpose libraries that include functionality to make it easier to analyze your data
In the early days of Hadoop, the only way to process the data in your system was to work with MapReduce in Java, but this approach presented a couple of major problems:
Your analytic writers need to not only understand your business and your data, but they also need to understand Java code
Pushing a Java archive to Hadoop is more time-consuming than simply authoring a query
For example, the process of developing and testing a simple analytic written directly in MapReduce might look something ...