To train a model on massive click logs, we first need to load the data in Spark. We do so by taking the following steps:
- First, we spin up the PySpark shell by using the following command:
./bin/pyspark --master local[*] --driver-memory 20G
Here, we specify a large driver memory as we are dealing with a dataset of more than 6 GB.
- Start a Spark session with an application named CTR:
>>> spark = SparkSession\... .builder\... .appName("CTR")\... .getOrCreate()
- Then, we load the click log data from the train file into a DataFrame object. Note, the data load function spark.read.csv allows custom schema, which guarantees data is loaded as expected, as opposed to inferring by default. So first, we define the schema: