Loading click logs

To train a model on massive click logs, we first need to load the data in Spark. We do so by taking the following steps: 

  1. First, we spin up the PySpark shell by using the following command:
./bin/pyspark --master local[*]  --driver-memory 20G

Here, we specify a large driver memory as we are dealing with a dataset of more than 6 GB.

  1. Start a Spark session with an application named CTR:
>>> spark = SparkSession\...     .builder\...     .appName("CTR")\...     .getOrCreate()
  1. Then, we load the click log data from the train file into a DataFrame object. Note, the data load function spark.read.csv allows custom schema, which guarantees data is loaded as expected, as opposed to inferring by default. So first, we define the schema:
>>> ...

Get Python Machine Learning By Example - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.