The following section explains the techniques used and insights gained from exploratory data analysis.
- The date column in the dataframe is more of a date-time column with the time values all ending in 00:00:00. This is unnecessary for what we will need during our modeling and therefore can be removed from the dataset. Luckily for us, PySpark has a to_date function that can do this quite easily. The dataframe, df, is transformed using the withColumn() function and now only shows the date column without the timestamp, as seen in the following screenshot:
- For analysis purposes, we want to extract the day, month, and year from ...