Skip to Main Content
Data Science Fundamentals with R, Python, and Open Data
book

Data Science Fundamentals with R, Python, and Open Data

by Marco Cremonini
April 2024
Beginner to intermediate content levelBeginner to intermediate
480 pages
12h 22m
English
Wiley
Content preview from Data Science Fundamentals with R, Python, and Open Data

7Groups and Operations on Groups

Up to now, we have seen data wrangling operations on a full data frame or on subsets of rows and columns. Often, we also need to calculate statistics on groups of rows sharing common properties. For instance, based on population data, we may want to obtain statistics on gender, age, place of residence, education level, and so forth. This means that we need a way, first, to identify rows with the common feature (e.g. same gender, same education level), then to compute statistics for each subset of observations.

With the knowledge we have gained so far, we already know how to achieve this kind of results, but it is highly inefficient. To obtain subsets of rows sharing common properties, we could use logical conditions and filtering functions, one for each group of rows we are interested in, and create a new data frame for each group. Then, for each data frame, we could calculate the desired statistics, like the mean, max/min, etc. This solution works, indeed, but at what cost? Well… it depends. If we have a dataset with US census data and we just want general statistics for women and men, that requires obtaining two data frames and calculating the statistics for each data frame. It may not look like a tremendous effort or a waste of time. What about if we want those statistics at regional level? In the US Census data, at level 3, regions are nine (“New England,” “Middle Atlantic,” “East North Central,” “West North Central,” “South Atlantic,” “East ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python and R for the Modern Data Scientist

Python and R for the Modern Data Scientist

Rick J. Scavetta, Boyan Angelov

Publisher Resources

ISBN: 9781394213245Purchase Link