Chapter 3. Loading and Preparing Data – DataFrame

In this chapter, we will cover the following recipes:

  • Loading more than 22 features into classes
  • Loading JSON into DataFrames
  • Storing data as Parquet files
  • Using the Avro data model in Parquet
  • Loading from RDBMS
  • Preparing data in DataFrames

Introduction

In previous chapters, we saw how to import data from a CSV file to Breeze and Spark DataFrames. However, almost all the time, the source data that is to be analyzed is available in a variety of source formats. Spark, with its DataFrame API, provides a uniform API that can be used to represent any source (or multiple sources). In this chapter, we'll focus on the various input formats that we can load from in Spark. Towards the end of this chapter, we'll ...

Get Scala Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.