Typically, semi-structured formats such as JSON contain struct, map, and array data types; for example, request and/or response payloads for REST web services contain JSON data with nested fields and arrays.
In this section, we will present examples of Spark SQL-based transformations on Twitter data. The input Dataset is a file (cache-0.json.gz) containing 10 M tweets from a set of Datasets containing over 170 M tweets collected during the three months leading up to the 2012 US presidential elections. This file can be downloaded from https://datahub.io/dataset/twitter-2012-presidential-election.
Before starting with the following examples, start Zookeeper and the Kafka broker as described in Chapter 5, ...