For generating the test data, we will select only the rows which fall beyond the cutoff points. That will yield the 20% sample. Note that the SparkR describe function is an alias for the summary() function. Use this when you want to avoid conflict of names between the Spark and base R functions.
#set the test data set to include the rest of the population set.seed(123) test <- filter(out_sd, out_sd$sample_bin <= Cutoff_low | out_sd$sample_bin >= Cutoff_high) test_sumdf = describe(test) display(select(test_sumdf,"summary","outcome","pregnant","age","mass","glucose","triceps"))