O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Reading the Stop and Frisk table

The first code chunk will read the StopFrisk dataframe, similar to how it was performed earlier in this section using spark.sql. Observe that the syntax for sql using Python is very similar to what we have been using with R.

Within the SQL call, the outcome variable frisked is mapped to a binary variable using a CASE statement. The reason for doing this is that the MLLib algorithm handles integer data much better than character data. If using character data, it often needs to be mapped to an integer or a labeled point.

The resulting dataframe (df2) is then displayed using the show(5) function, which is the Python equivalent to the R head(df2,5) function:

%python  from pyspark.mllib.tree import DecisionTree, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required