Summary Statistics
R includes a variety of functions for calculating summary statistics.
To calculate the mean of a vector, use the mean
function. You can calculate minima with the min
function, or maxima with the max
function. As an example, let’s use the dow30
data set that we created in An extended example. This data set is also available in
the nutshell
package:
> library(nutshell) > data(dow30) > mean(dow30$Open) [1] 36.24574 > min(dow30$Open) [1] 0.99 > max(dow30$Open) [1] 122.45
For each of these functions, the argument na.rm
specifies how NA
values are treated. By default, if any value
in the vector is NA
, then the value
NA
is returned. Specify na.rm=TRUE
to ignore missing values:
> mean(c(1, 2, 3, 4, 5, NA)) [1] NA > mean(c(1, 2, 3, 4, 5, NA), na.rm=TRUE) [1] 3
Optionally, you can also remove outliers when using the mean
function. To do this, use the trim
argument to specify the fraction of
observations to filter:
> mean(c(-1, 0:100, 2000)) [1] 68.4369 > mean(c(-1, 0:100, 2000), trim=0.1) [1] 50
To calculate the minimum and maximum at the same time, use the
range
function. This returns a vector with the minimum and
maximum value:
> range(dow30$Open)
[1] 0.99 122.45
Another useful function is quantile
. This function can be used to return the values at
different percentiles (specified by the probs
argument):
> quantile(dow30$Open, probs=c(0, 0.25, 0.5, 0.75, 1.0))
0% 25% 50% 75% 100%
0.990 19.655 30.155 51.680 122.450
You can return this specific set of values (minimum, 25th percentile, ...
Get R in a Nutshell, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.