O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Using SQL to examine potential outliers

Based upon the preceding sample_bin counts, we might decide to extract a sample based upon a sample of some negative and positive cases. We know that positive sample_bin represent an outcome of 1, and negative sample_bins represent an outcome of 0. We can also pick a cutoff value which will accomodate whatever sample size we would like. We will be looking to extract a 10,000 row sample, so we will set the bounds to +10 and -10.

bin_extract <- SparkR::sql("SELECT * from out_tbl where sample_bin >= -10 AND sample_bin <= 10") nrow(bin_extract)#nrow should be 10,000 in the output

Next, we will register bin_extract so that we can perform some SQL

 SparkR:::registerTempTable(bin_extract,"bin_extract")  ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required