First, we will analyze the characteristics of MovieLens users.
We use a custom_schema to load the | delimited data into a DataFrame. This Python code is in com/sparksamples/Util.py:
def get_user_data(): custom_schema = StructType([ StructField("no", StringType(), True), StructField("age", IntegerType(), True), StructField("gender", StringType(), True), StructField("occupation", StringType(), True), StructField("zipCode", StringType(), True) ]) frompyspark.sql import SQLContext frompyspark.sql.types import * sql_context = SQLContext(sc) user_df = sql_context.read .format('com.databricks.spark.csv') .options(header='false', delimiter='|') .load("%s/ml-100k/u.user"% PATH, schema = custom_schema) returnuser_df ...