Chapter 5. Machine Learning Using EMR

So far we have covered various ways you can use EMR and AWS to accomplish some interesting tasks surrounding log data analysis. The next step in building such a system is to begin using machine learning algorithms aimed at predicting things based on your data. In the example for this chapter, we’ll use a clustering technique to derive interesting information about accesses to web log data.

A thorough discussion of machine learning is beyond the scope of this book. There are many great resources that will help you understand machine learning. Hilary Mason’s An Introduction to Machine Learning with Web Data is a great video course to get started. A more formal treatment of machine learning is available in this Coursera Machine Learning class. It’s taught by Stanford professor Andrew Ng and is very accessible to most people—you don’t need to be a computer scientist to learn the material.

This chapter will not make you a machine learning expert, but we present a few examples of how to use machine learning algorithms in EMR. Hopefully, this will pique your interest in learning more about this topic.

A Quick Tour of Machine Learning

What is machine learning? Put simply, machine learning is the application of statistical methods to derive meaning and understanding from information. The clustering algorithm we are going to use for this chapter is called k-Means. k-Means clustering is used to find a number of clusters, k, for a set of data. The exact number ...

Get Programming Elastic MapReduce now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.