7Groups and Operations on Groups

Up to now, we have seen data wrangling operations on a full data frame or on subsets of rows and columns. Often, we also need to calculate statistics on groups of rows sharing common properties. For instance, based on population data, we may want to obtain statistics on gender, age, place of residence, education level, and so forth. This means that we need a way, first, to identify rows with the common feature (e.g. same gender, same education level), then to compute statistics for each subset of observations.

With the knowledge we have gained so far, we already know how to achieve this kind of results, but it is highly inefficient. To obtain subsets of rows sharing common properties, we could use logical conditions and filtering functions, one for each group of rows we are interested in, and create a new data frame for each group. Then, for each data frame, we could calculate the desired statistics, like the mean, max/min, etc. This solution works, indeed, but at what cost? Well… it depends. If we have a dataset with US census data and we just want general statistics for women and men, that requires obtaining two data frames and calculating the statistics for each data frame. It may not look like a tremendous effort or a waste of time. What about if we want those statistics at regional level? In the US Census data, at level 3, regions are nine (“New England,” “Middle Atlantic,” “East North Central,” “West North Central,” “South Atlantic,” “East ...

Get Data Science Fundamentals with R, Python, and Open Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.