We will end up building this Spark dataframe via simulation. This will take up a good chunk of this chapter. I feel this is a better way to go rather than importing an existing public dataset in which you cannot control the makeup of the data. With a simulated dataset, you are free to size it however you like (subject to account restrictions).
However, you are always free to import whatever dataset you would like and the analytic concepts that follow will be the same.
- Preliminaries first, you will need to register and log on to your databricks account.
- Next, create a cluster. Give it a name, such as MyCluster.
- To conform with the examples in this chapter, make sure you choose Spark 2.1. This is very important. Since Spark is an ...