January 2016
Intermediate to advanced
416 pages
8h 54m
English
A major challenge in data science or engineering is dealing with the wealth of input and output formats for persisting data. We might receive or send data as CSV files, JSON files, or through a SQL database, to name a few.
Spark provides a unified API for serializing and de-serializing DataFrames to and from different data sources.
Spark supports loading data from JSON files, provided that each line in the JSON file corresponds to a single JSON object. Each object will be mapped to a DataFrame row. JSON arrays are mapped to arrays, and embedded objects are mapped to structs.
This section would be a little dry without some data, so let's generate some from the GitHub API. Unfortunately, the GitHub API does not ...
Read now
Unlock full access