Just to drive home how you can actually use DataFrames instead of RDDs, let's go through an example of actually going to one of our earlier exercises that we did with RDDs and do it with DataFrames instead. This will illustrate how using DataFrames can make things simpler. We go back to the example where we figured out the most popular movies based on the MovieLens DataSet ratings information. If you want to open the file, you'll find it in the download package as popular-movies-dataframe.py, or you can just follow along typing it in as you go. This is what your file should look like if you open it in your IDE:
Let's go through this in detail. First comes our import statements:
from pyspark.sql import SparkSession ...