This section explains how we create our Spark cluster and configure our first dataframe.
- In Spark, we use .master() to specify whether we will run our jobs on a distributed cluster or locally. For the purposes of this chapter and the remaining chapters, we will be executing Spark locally with one worker thread as specified with .master('local'). This is fine for testing and development purposes as we are doing in this chapter; however, we may run into performance issues if we deployed this to production. In production, it is recommended to use .master('local[*]') to set Spark to run on as many worker nodes that are available locally as possible. If we had 3 cores on our machine and we wanted to set our node count to match ...