Exploring the user dataset

First, we will analyze the characteristics of MovieLens users.

We use a custom_schema to load the | delimited data into a DataFrame. This Python code is in com/sparksamples/Util.py:

def get_user_data():   custom_schema = StructType([   StructField("no", StringType(), True),   StructField("age", IntegerType(), True),   StructField("gender", StringType(), True),   StructField("occupation", StringType(), True),   StructField("zipCode", StringType(), True) ]) frompyspark.sql import SQLContext frompyspark.sql.types import * sql_context = SQLContext(sc) user_df = sql_context.read    .format('com.databricks.spark.csv')    .options(header='false', delimiter='|')    .load("%s/ml-100k/u.user"% PATH, schema =  custom_schema) returnuser_df ...

Get Machine Learning with Spark - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.