Based upon the preceding sample_bin counts, we might decide to extract a sample based upon a sample of some negative and positive cases. We know that positive sample_bin represent an outcome of 1, and negative sample_bins represent an outcome of 0. We can also pick a cutoff value which will accomodate whatever sample size we would like. We will be looking to extract a 10,000 row sample, so we will set the bounds to +10 and -10.
bin_extract <- SparkR::sql("SELECT * from out_tbl where sample_bin >= -10 AND sample_bin <= 10") nrow(bin_extract)#nrow should be 10,000 in the output
Next, we will register bin_extract so that we can perform some SQL