Skip to Content
Learning Spark
book

Learning Spark

by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
February 2015
Intermediate to advanced
276 pages
7h 18m
English
O'Reilly Media, Inc.
Content preview from Learning Spark

Chapter 11. Machine Learning with MLlib

MLlib is Spark’s library of machine learning functions. Designed to run in parallel on clusters, MLlib contains a variety of learning algorithms and is accessible from all of Spark’s programming languages. This chapter will show you how to call it in your own programs, and offer common usage tips.

Machine learning itself is a topic large enough to fill many books,1 so unfortunately, in this chapter, we will not have the space to explain machine learning in detail. If you are familiar with machine learning, however, this chapter will explain how to use it in Spark; and even if you are new to it, you should be able to combine the material here with other introductory material. This chapter is most relevant to data scientists with a machine learning background looking to use Spark, as well as engineers working with a machine learning expert.

Overview

MLlib’s design and philosophy are simple: it lets you invoke various algorithms on distributed datasets, representing all data as RDDs. MLlib introduces a few data types (e.g., labeled points and vectors), but at the end of the day, it is simply a set of functions to call on RDDs. For example, to use MLlib for a text classification task (e.g., identifying spammy emails), you might do the following:

  1. Start with an RDD of strings representing your messages.

  2. Run one of MLlib’s feature extraction algorithms to convert text into numerical features (suitable for learning algorithms); this will give ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Learning Spark, 2nd Edition

Learning Spark, 2nd Edition

Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee
Learning PySpark

Learning PySpark

Tomasz Drabas, Denny Lee
Spark: The Definitive Guide

Spark: The Definitive Guide

Bill Chambers, Matei Zaharia
High Performance Spark, 2nd Edition

High Performance Spark, 2nd Edition

Holden Karau, Adi Polak, Rachel Warren

Publisher Resources

ISBN: 9781449359034Errata PageSupplemental Content