Chapter 3. Understanding the Problem by Understanding the Data

This chapter will cover in details of the DataFrame, Datasets, and Resilient Distributed Dataset (RDD) APIs for working with structured data targeting to provide a basic understanding of machine learning problems with the available data. At the end of the chapter you will be able to apply basic to complex data manipulation with ease. Some comparisons will be made available with basic abstractions in Spark using RDD, DataFrame, and Dataset based data manipulation to show both gains in terms of programming and performance. In addition, we will guide you on the right track so that you will be able to use Spark to persist an RDD or data objects in memory, allowing it to be reused efficiently ...

Get Large Scale Machine Learning with Spark now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.