O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Indexing the classification features

Indexing is used to optimize data access and supply the parameters to specific machine learning algorithms in an acceptable format.

We will be incorporating the race variable into the decision tree model, so the first step is to determine what the different values of race are. We will do this by again using SQL to count the frequency by race. Notice we can say either "Group by Race" or "Group by 1" which is a shorthand reference to the first column specified in the select statement (which is race):

%python dfx = spark.sql("SELECT race,count(*) FROM stopfrisk group by 1") dfx.show()  

Observe that there are eight values, Q, B, U, Z, A, W, I, and P:

Next, use indexer.fit(df2) transform. This will map a ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required