R for Data Science, 2nd Edition

Book description

Use R to turn data into insight, knowledge, and understanding. With this practical book, aspiring data scientists will learn how to do data science with R and RStudio, along with the tidyverse—a collection of R packages designed to work together to make data science fast, fluent, and fun. Even if you have no programming experience, this updated edition will have you doing data science quickly.

You'll learn how to import, transform, and visualize your data and communicate the results. And you'll get a complete, big-picture understanding of the data science cycle and the basic tools you need to manage the details. Updated for the latest tidyverse features and best practices, new chapters show you how to get data from spreadsheets, databases, and websites. Exercises help you practice what you've learned along the way.

You'll understand how to:

  • Visualize: Create plots for data exploration and communication of results
  • Transform: Discover variable types and the tools to work with them
  • Import: Get data into R and in a form convenient for analysis
  • Program: Learn R tools for solving data problems with greater clarity and ease
  • Communicate: Integrate prose, code, and results with Quarto

Publisher resources

View/Submit Errata

Table of contents

  1. Introduction
    1. Preface to the Second Edition
    2. What You Will Learn
    3. How This Book Is Organized
    4. What You Won’t Learn
      1. Modeling
      2. Big Data
      3. Python, Julia, and Friends
    5. Prerequisites
      1. R
      2. RStudio
      3. The Tidyverse
      4. Other Packages
    6. Running R Code
    7. Other Conventions Used in This Book
    8. O’Reilly Online Learning
    9. How to Contact Us
    10. Acknowledgments
    11. Online Edition
  2. I. Whole Game
  3. 1. Data Visualization
    1. Introduction
      1. Prerequisites
    2. First Steps
      1. The penguins Data Frame
      2. Ultimate Goal
      3. Creating a ggplot
      4. Adding Aesthetics and Layers
      5. Exercises
    3. ggplot2 Calls
    4. Visualizing Distributions
      1. A Categorical Variable
      2. A Numerical Variable
      3. Exercises
    5. Visualizing Relationships
      1. A Numerical and a Categorical Variable
      2. Two Categorical Variables
      3. Two Numerical Variables
      4. Three or More Variables
      5. Exercises
    6. Saving Your Plots
      1. Exercises
    7. Common Problems
    8. Summary
  4. 2. Workflow: Basics
    1. Coding Basics
    2. Comments
    3. What’s in a Name?
    4. Calling Functions
    5. Exercises
    6. Summary
  5. 3. Data Transformation
    1. Introduction
      1. Prerequisites
      2. nycflights13
      3. dplyr Basics
    2. Rows
      1. filter()
      2. Common Mistakes
      3. arrange()
      4. distinct()
      5. Exercises
    3. Columns
      1. mutate()
      2. select()
      3. rename()
      4. relocate()
      5. Exercises
    4. The Pipe
    5. Groups
      1. group_by()
      2. summarize()
      3. The slice_ Functions
      4. Grouping by Multiple Variables
      5. Ungrouping
      6. .by
      7. Exercises
    6. Case Study: Aggregates and Sample Size
    7. Summary
  6. 4. Workflow: Code Style
    1. Names
    2. Spaces
    3. Pipes
    4. ggplot2
    5. Sectioning Comments
    6. Exercises
    7. Summary
  7. 5. Data Tidying
    1. Introduction
      1. Prerequisites
    2. Tidy Data
      1. Exercises
    3. Lengthening Data
      1. Data in Column Names
      2. How Does Pivoting Work?
      3. Many Variables in Column Names
      4. Data and Variable Names in the Column Headers
    4. Widening Data
      1. How Does pivot_wider() Work?
    5. Summary
  8. 6. Workflow: Scripts and Projects
    1. Scripts
      1. Running Code
      2. RStudio Diagnostics
      3. Saving and Naming
    2. Projects
      1. What Is the Source of Truth?
      2. Where Does Your Analysis Live?
      3. RStudio Projects
      4. Relative and Absolute Paths
    3. Exercises
    4. Summary
  9. 7. Data Import
    1. Introduction
      1. Prerequisites
    2. Reading Data from a File
      1. Practical Advice
      2. Other Arguments
      3. Other File Types
      4. Exercises
    3. Controlling Column Types
      1. Guessing Types
      2. Missing Values, Column Types, and Problems
      3. Column Types
    4. Reading Data from Multiple Files
    5. Writing to a File
    6. Data Entry
    7. Summary
  10. 8. Workflow: Getting Help
    1. Google Is Your Friend
    2. Making a reprex
    3. Investing in Yourself
    4. Summary
  11. II. Visualize
  12. 9. Layers
    1. Introduction
      1. Prerequisites
    2. Aesthetic Mappings
      1. Exercises
    3. Geometric Objects
      1. Exercises
    4. Facets
      1. Exercises
    5. Statistical Transformations
      1. Exercises
    6. Position Adjustments
      1. Exercises
    7. Coordinate Systems
      1. Exercises
    8. The Layered Grammar of Graphics
    9. Summary
  13. 10. Exploratory Data Analysis
    1. Introduction
      1. Prerequisites
    2. Questions
    3. Variation
      1. Typical Values
      2. Unusual Values
      3. Exercises
    4. Unusual Values
      1. Exercises
    5. Covariation
      1. A Categorical and a Numerical Variable
      2. Two Categorical Variables
      3. Two Numerical Variables
    6. Patterns and Models
    7. Summary
  14. 11. Communication
    1. Introduction
      1. Prerequisites
    2. Labels
      1. Exercises
    3. Annotations
      1. Exercises
    4. Scales
      1. Default Scales
      2. Axis Ticks and Legend Keys
      3. Legend Layout
      4. Replacing a Scale
      5. Zooming
      6. Exercises
    5. Themes
      1. Exercises
    6. Layout
      1. Exercises
    7. Summary
  15. III. Transform
  16. 12. Logical Vectors
    1. Introduction
      1. Prerequisites
    2. Comparisons
      1. Floating-Point Comparison
      2. Missing Values
      3. is.na()
      4. Exercises
    3. Boolean Algebra
      1. Missing Values
      2. Order of Operations
      3. %in%
      4. Exercises
    4. Summaries
      1. Logical Summaries
      2. Numeric Summaries of Logical Vectors
      3. Logical Subsetting
      4. Exercises
    5. Conditional Transformations
      1. if_else()
      2. case_when()
      3. Compatible Types
      4. Exercises
    6. Summary
  17. 13. Numbers
    1. Introduction
      1. Prerequisites
    2. Making Numbers
    3. Counts
      1. Exercises
    4. Numeric Transformations
      1. Arithmetic and Recycling Rules
      2. Minimum and Maximum
      3. Modular Arithmetic
      4. Logarithms
      5. Rounding
      6. Cutting Numbers into Ranges
      7. Cumulative and Rolling Aggregates
      8. Exercises
    5. General Transformations
      1. Ranks
      2. Offsets
      3. Consecutive Identifiers
      4. Exercises
    6. Numeric Summaries
      1. Center
      2. Minimum, Maximum, and Quantiles
      3. Spread
      4. Distributions
      5. Positions
      6. With mutate()
      7. Exercises
    7. Summary
  18. 14. Strings
    1. Introduction
      1. Prerequisites
    2. Creating a String
      1. Escapes
      2. Raw Strings
      3. Other Special Characters
      4. Exercises
    3. Creating Many Strings from Data
      1. str_c()
      2. str_glue()
      3. str_flatten()
      4. Exercises
    4. Extracting Data from Strings
      1. Separating into Rows
      2. Separating into Columns
      3. Diagnosing Widening Problems
    5. Letters
      1. Length
      2. Subsetting
      3. Exercises
    6. Non-English Text
      1. Encoding
      2. Letter Variations
      3. Locale-Dependent Functions
    7. Summary
  19. 15. Regular Expressions
    1. Introduction
      1. Prerequisites
    2. Pattern Basics
    3. Key Functions
      1. Detect Matches
      2. Count Matches
      3. Replace Values
      4. Extract Variables
      5. Exercises
    4. Pattern Details
      1. Escaping
      2. Anchors
      3. Character Classes
      4. Quantifiers
      5. Operator Precedence and Parentheses
      6. Grouping and Capturing
      7. Exercises
    5. Pattern Control
      1. Regex Flags
      2. Fixed Matches
    6. Practice
      1. Check Your Work
      2. Boolean Operations
      3. Creating a Pattern with Code
      4. Exercises
    7. Regular Expressions in Other Places
      1. Tidyverse
      2. Base R
    8. Summary
  20. 16. Factors
    1. Introduction
      1. Prerequisites
    2. Factor Basics
    3. General Social Survey
      1. Exercise
    4. Modifying Factor Order
      1. Exercises
    5. Modifying Factor Levels
      1. Exercises
    6. Ordered Factors
    7. Summary
  21. 17. Dates and Times
    1. Introduction
      1. Prerequisites
    2. Creating Date/Times
      1. During Import
      2. From Strings
      3. From Individual Components
      4. From Other Types
      5. Exercises
    3. Date-Time Components
      1. Getting Components
      2. Rounding
      3. Modifying Components
      4. Exercises
    4. Time Spans
      1. Durations
      2. Periods
      3. Intervals
      4. Exercises
    5. Time Zones
    6. Summary
  22. 18. Missing Values
    1. Introduction
      1. Prerequisites
    2. Explicit Missing Values
      1. Last Observation Carried Forward
      2. Fixed Values
      3. NaN
    3. Implicit Missing Values
      1. Pivoting
      2. Complete
      3. Joins
      4. Exercises
    4. Factors and Empty Groups
    5. Summary
  23. 19. Joins
    1. Introduction
      1. Prerequisites
    2. Keys
      1. Primary and Foreign Keys
      2. Checking Primary Keys
      3. Surrogate Keys
      4. Exercises
    3. Basic Joins
      1. Mutating Joins
      2. Specifying Join Keys
      3. Filtering Joins
      4. Exercises
    4. How Do Joins Work?
      1. Row Matching
      2. Filtering Joins
    5. Non-Equi Joins
      1. Cross Joins
      2. Inequality Joins
      3. Rolling Joins
      4. Overlap Joins
      5. Exercises
    6. Summary
  24. IV. Import
  25. 20. Spreadsheets
    1. Introduction
    2. Excel
      1. Prerequisites
      2. Getting Started
      3. Reading Excel Spreadsheets
      4. Reading Worksheets
      5. Reading Part of a Sheet
      6. Data Types
      7. Writing to Excel
      8. Formatted Output
      9. Exercises
    3. Google Sheets
      1. Prerequisites
      2. Getting Started
      3. Reading Google Sheets
      4. Writing to Google Sheets
      5. Authentication
      6. Exercises
    4. Summary
  26. 21. Databases
    1. Introduction
      1. Prerequisites
    2. Database Basics
    3. Connecting to a Database
      1. In This Book
      2. Load Some Data
      3. DBI Basics
    4. dbplyr Basics
    5. SQL
      1. SQL Basics
      2. SELECT
      3. FROM
      4. GROUP BY
      5. WHERE
      6. ORDER BY
      7. Subqueries
      8. Joins
      9. Other Verbs
      10. Exercises
    6. Function Translations
    7. Summary
  27. 22. Arrow
    1. Introduction
      1. Prerequisites
    2. Getting the Data
    3. Opening a Dataset
    4. The Parquet Format
      1. Advantages of Parquet
      2. Partitioning
      3. Rewriting the Seattle Library Data
    5. Using dplyr with Arrow
      1. Performance
      2. Using dbplyr with Arrow
    6. Summary
  28. 23. Hierarchical Data
    1. Introduction
      1. Prerequisites
    2. Lists
      1. Hierarchy
      2. List Columns
    3. Unnesting
      1. unnest_wider()
      2. unnest_longer()
      3. Inconsistent Types
      4. Other Functions
      5. Exercises
    4. Case Studies
      1. Very Wide Data
      2. Relational Data
      3. Deeply Nested
      4. Exercises
    5. JSON
      1. Data Types
      2. jsonlite
      3. Starting the Rectangling Process
      4. Exercises
    6. Summary
  29. 24. Web Scraping
    1. Introduction
      1. Prerequisites
    2. Scraping Ethics and Legalities
      1. Terms of Service
      2. Personally Identifiable Information
      3. Copyright
    3. HTML Basics
      1. Elements
      2. Attributes
    4. Extracting Data
      1. Find Elements
      2. Nesting Selections
      3. Text and Attributes
      4. Tables
    5. Finding the Right Selectors
    6. Putting It All Together
      1. Star Wars
      2. IMDb Top Films
    7. Dynamic Sites
    8. Summary
  30. V. Program
  31. 25. Functions
    1. Introduction
      1. Prerequisites
    2. Vector Functions
      1. Writing a Function
      2. Improving Our Function
      3. Mutate Functions
      4. Summary Functions
      5. Exercises
    3. Data Frame Functions
      1. Indirection and Tidy Evaluation
      2. When to Embrace?
      3. Common Use Cases
      4. Data Masking Versus Tidy Selection
      5. Exercises
    4. Plot Functions
      1. More Variables
      2. Combining with Other Tidyverse Packages
      3. Labeling
      4. Exercises
    5. Style
      1. Exercises
    6. Summary
  32. 26. Iteration
    1. Introduction
      1. Prerequisites
    2. Modifying Multiple Columns
      1. Selecting Columns with .cols
      2. Calling a Single Function
      3. Calling Multiple Functions
      4. Column Names
      5. Filtering
      6. across() in Functions
      7. Versus pivot_longer()
      8. Exercises
    3. Reading Multiple Files
      1. Listing Files in a Directory
      2. Lists
      3. purrr::map() and list_rbind()
      4. Data in the Path
      5. Save Your Work
      6. Many Simple Iterations
      7. Heterogeneous Data
      8. Handling Failures
    4. Saving Multiple Outputs
      1. Writing to a Database
      2. Writing CSV Files
      3. Saving Plots
    5. Summary
  33. 27. A Field Guide to Base R
    1. Introduction
      1. Prerequisites
    2. Selecting Multiple Elements with [
      1. Subsetting Vectors
      2. Subsetting Data Frames
      3. dplyr Equivalents
      4. Exercises
    3. Selecting a Single Element with $ and [[
      1. Data Frames
      2. Tibbles
      3. Lists
      4. Exercises
    4. Apply Family
    5. for Loops
    6. Plots
    7. Summary
  34. VI. Communicate
  35. 28. Quarto
    1. Introduction
      1. Prerequisites
    2. Quarto Basics
      1. Exercises
    3. Visual Editor
      1. Exercises
    4. Source Editor
      1. Exercises
    5. Code Chunks
      1. Chunk Label
      2. Chunk Options
      3. Global Options
      4. Inline Code
      5. Exercises
    6. Figures
      1. Figure Sizing
      2. Other Important Options
      3. Exercises
    7. Tables
      1. Exercises
    8. Caching
      1. Exercises
    9. Troubleshooting
    10. YAML Header
      1. Self-Contained
      2. Parameters
      3. Bibliographies and Citations
    11. Workflow
    12. Summary
  36. 29. Quarto Formats
    1. Introduction
    2. Output Options
    3. Documents
    4. Presentations
    5. Interactivity
      1. htmlwidgets
      2. Shiny
    6. Websites and Books
    7. Other Formats
    8. Summary
  37. Index
  38. About the Authors

Product information

  • Title: R for Data Science, 2nd Edition
  • Author(s): Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund
  • Release date: June 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492097402