Chapter 8. Data Manipulation and Visualization in R

American statistician Ronald Thisted once quipped: “Raw data, like raw potatoes, usually require cleaning before use.” Data manipulation takes time, and you’ve felt the pain if you’ve ever done the following:

  • Select, drop, or create calculated columns

  • Sort or filter rows

  • Group by and summarize categories

  • Join multiple datasets by a common field

Chances are, you’ve done all of these in Excel…a lot, and you’ve probably dug into celebrated features like VLOOKUP() and PivotTables to accomplish them. In this chapter, you’ll learn the R equivalents of these techniques, particularly with the help of dplyr.

Data manipulation often goes hand in hand with visualization: as mentioned, humans are remarkably adept at visually processing information, so it’s a great way to size up a dataset. You’ll learn how to visualize data using the gorgeous ggplot2 package, which like dplyr is part of the tidyverse. This will put you on solid footing to explore and test relationships in data using R, which will be covered in Chapter 9. Let’s get started by calling in the relevant packages. We’ll also be using the star dataset from the book’s companion repository in this chapter, so we can import it now:

library(tidyverse)
library(readxl)

star <- read_excel('datasets/star/star.xlsx')
head(star)
#> # A tibble: 6 x 8
#>   tmathssk treadssk classk            totexpk sex   freelunk race  schidkn
#>      <dbl>    <dbl> <chr>               <dbl> <chr> <chr>    <chr>   <dbl>
#> 1 473 447 small.class ...

Get Advancing into Analytics now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.