June 2017
Beginner to intermediate
576 pages
15h 22m
English
Now we will construct a third query that will extract all records that may be considered outliers. For this example, we will define an outlier as any record that has age or pressure greater or less than 1.5 standard deviations below the mean for their outcome class. This is accomplished by joining our detail-level data with the aggregated means for age and pressure:
We can also compute a new column, agediff, which is the difference between age and average age.
We add limit=1000 as a protective filter, so that we retrieve more than the number of results. Placing limits on SQL queries tends to speed up result processing. In this case one record is returned:
anomolies <- SparkR::sql("select ...