The dataset is saved to the same directory that houses the Jupyter notebook for ease of import into the Spark session.
- A local pyspark session is initialized by importing SparkSession from pyspark.sql.
- A dataframe, df, is created by reading in the CSV file with the options header = 'true' and inferschema = 'true'.
- Finally, it is always ideal to run a script to show the data that has been imported into Spark through the dataframe to confirm that the data has made its way through. The outcome of the script, showing the first two rows of the dataset from the San Francisco fire department calls, can be seen in the following screenshot:
Please note that as we read the file into spark, we are using .load() to pull the .csv file ...