The first code chunk will read the StopFrisk dataframe, similar to how it was performed earlier in this section using spark.sql. Observe that the syntax for sql using Python is very similar to what we have been using with R.
Within the SQL call, the outcome variable frisked is mapped to a binary variable using a CASE statement. The reason for doing this is that the MLLib algorithm handles integer data much better than character data. If using character data, it often needs to be mapped to an integer or a labeled point.
The resulting dataframe (df2) is then displayed using the show(5) function, which is the Python equivalent to the R head(df2,5) function:
%python from pyspark.mllib.tree import DecisionTree, ...