O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

First collect the sample

To illustrate this, let's first take a 50% sample of the stop and frisk dataframe. We also want to make sure that the amount of data we extract can be processed easily by base R, which has a memory limitation that is dependent upon the CPU size.

  • The code below will first extract a 50% sample from Spark and store it in a local R dataframe named dflocal.
  • Then it will run an str() command to verify the rowcount and the metadata:
dflocal = collect(sample(df, F,.50,123)) str(dflocal) 
The output indicates that there are 11,311 rows, which is roughly 50% of the 22,563 rows from the Stop and Frisk data.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required