O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Picking out some potential outliers using a third query

Now we will construct a third query that will extract all records that may be considered outliers. For this example, we will define an outlier as any record that has age or pressure greater or less than 1.5 standard deviations below the mean for their outcome class. This is accomplished by joining our detail-level data with the aggregated means for age and pressure:

  • We can also compute a new column, agediff, which is the difference between age and average age.

  • We add limit=1000 as a protective filter, so that we retrieve more than the number of results. Placing limits on SQL queries tends to speed up result processing. In this case one record is returned:

 anomolies <- SparkR::sql("select ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required