book

R for Data Science, 2nd Edition

by Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund

June 2023

Beginner to intermediate

576 pages

12h 57m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
Preface to the Second EditionWhat You Will LearnHow This Book Is OrganizedWhat You Won’t LearnModelingBig DataPython, Julia, and FriendsPrerequisitesRRStudioThe TidyverseOther PackagesRunning R CodeOther Conventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgmentsOnline Edition
I. Whole Game
1. Data Visualization
IntroductionPrerequisitesFirst StepsThe penguins Data FrameUltimate GoalCreating a ggplotAdding Aesthetics and LayersExercisesggplot2 CallsVisualizing DistributionsA Categorical VariableA Numerical VariableExercisesVisualizing RelationshipsA Numerical and a Categorical VariableTwo Categorical VariablesTwo Numerical VariablesThree or More VariablesExercisesSaving Your PlotsExercisesCommon ProblemsSummary
2. Workflow: Basics
Coding BasicsCommentsWhat’s in a Name?Calling FunctionsExercisesSummary
3. Data Transformation
IntroductionPrerequisitesnycflights13dplyr BasicsRowsfilter() Common Mistakesarrange() distinct() ExercisesColumnsmutate()select()rename() relocate() ExercisesThe PipeGroupsgroup_by()summarize()The slice_ FunctionsGrouping by Multiple VariablesUngrouping.by ExercisesCase Study: Aggregates and Sample SizeSummary
4. Workflow: Code Style
NamesSpacesPipesggplot2Sectioning CommentsExercisesSummary
5. Data Tidying
IntroductionPrerequisitesTidy DataExercisesLengthening DataData in Column NamesHow Does Pivoting Work?Many Variables in Column NamesData and Variable Names in the Column HeadersWidening DataHow Does pivot_wider() Work?Summary
6. Workflow: Scripts and Projects
ScriptsRunning CodeRStudio DiagnosticsSaving and NamingProjectsWhat Is the Source of Truth?Where Does Your Analysis Live?RStudio ProjectsRelative and Absolute PathsExercisesSummary
7. Data Import
IntroductionPrerequisitesReading Data from a FilePractical AdviceOther ArgumentsOther File TypesExercisesControlling Column TypesGuessing TypesMissing Values, Column Types, and ProblemsColumn TypesReading Data from Multiple FilesWriting to a FileData EntrySummary
8. Workflow: Getting Help
Google Is Your FriendMaking a reprexInvesting in YourselfSummary

II. Visualize
9. Layers
IntroductionPrerequisitesAesthetic MappingsExercisesGeometric ObjectsExercisesFacetsExercisesStatistical TransformationsExercisesPosition AdjustmentsExercisesCoordinate SystemsExercisesThe Layered Grammar of GraphicsSummary
10. Exploratory Data Analysis
IntroductionPrerequisitesQuestionsVariationTypical ValuesUnusual ValuesExercisesUnusual ValuesExercisesCovariationA Categorical and a Numerical VariableTwo Categorical VariablesTwo Numerical VariablesPatterns and ModelsSummary
11. Communication
IntroductionPrerequisitesLabelsExercisesAnnotationsExercisesScalesDefault ScalesAxis Ticks and Legend KeysLegend LayoutReplacing a ScaleZoomingExercisesThemesExercisesLayoutExercisesSummary
III. Transform
12. Logical Vectors
IntroductionPrerequisitesComparisonsFloating-Point ComparisonMissing Valuesis.na()ExercisesBoolean AlgebraMissing ValuesOrder of Operations%in%ExercisesSummariesLogical SummariesNumeric Summaries of Logical VectorsLogical SubsettingExercisesConditional Transformationsif_else()case_when()Compatible TypesExercisesSummary
13. Numbers
IntroductionPrerequisitesMaking NumbersCountsExercisesNumeric TransformationsArithmetic and Recycling RulesMinimum and MaximumModular ArithmeticLogarithmsRoundingCutting Numbers into RangesCumulative and Rolling AggregatesExercisesGeneral TransformationsRanksOffsetsConsecutive IdentifiersExercisesNumeric SummariesCenterMinimum, Maximum, and QuantilesSpreadDistributionsPositionsWith mutate()ExercisesSummary
14. Strings
IntroductionPrerequisitesCreating a StringEscapesRaw StringsOther Special CharactersExercisesCreating Many Strings from Datastr_c()str_glue()str_flatten()ExercisesExtracting Data from StringsSeparating into RowsSeparating into ColumnsDiagnosing Widening ProblemsLettersLengthSubsettingExercisesNon-English TextEncodingLetter VariationsLocale-Dependent FunctionsSummary
15. Regular Expressions
IntroductionPrerequisitesPattern BasicsKey FunctionsDetect MatchesCount MatchesReplace ValuesExtract VariablesExercisesPattern DetailsEscapingAnchorsCharacter ClassesQuantifiersOperator Precedence and ParenthesesGrouping and CapturingExercisesPattern ControlRegex FlagsFixed MatchesPracticeCheck Your WorkBoolean OperationsCreating a Pattern with CodeExercisesRegular Expressions in Other PlacesTidyverseBase RSummary
16. Factors
IntroductionPrerequisitesFactor BasicsGeneral Social SurveyExerciseModifying Factor OrderExercisesModifying Factor LevelsExercisesOrdered FactorsSummary
17. Dates and Times
IntroductionPrerequisitesCreating Date/TimesDuring ImportFrom StringsFrom Individual ComponentsFrom Other TypesExercisesDate-Time ComponentsGetting ComponentsRoundingModifying ComponentsExercisesTime SpansDurationsPeriodsIntervalsExercisesTime ZonesSummary
18. Missing Values
IntroductionPrerequisitesExplicit Missing ValuesLast Observation Carried ForwardFixed ValuesNaNImplicit Missing ValuesPivotingCompleteJoinsExercisesFactors and Empty GroupsSummary
19. Joins
IntroductionPrerequisitesKeysPrimary and Foreign KeysChecking Primary KeysSurrogate KeysExercisesBasic JoinsMutating JoinsSpecifying Join KeysFiltering JoinsExercisesHow Do Joins Work?Row MatchingFiltering JoinsNon-Equi JoinsCross JoinsInequality JoinsRolling JoinsOverlap JoinsExercisesSummary
IV. Import
20. Spreadsheets
IntroductionExcelPrerequisitesGetting StartedReading Excel SpreadsheetsReading WorksheetsReading Part of a SheetData TypesWriting to ExcelFormatted OutputExercisesGoogle SheetsPrerequisitesGetting StartedReading Google SheetsWriting to Google SheetsAuthenticationExercisesSummary
21. Databases
IntroductionPrerequisitesDatabase BasicsConnecting to a DatabaseIn This BookLoad Some DataDBI Basicsdbplyr BasicsSQLSQL BasicsSELECTFROMGROUP BYWHEREORDER BYSubqueriesJoinsOther VerbsExercisesFunction TranslationsSummary
22. Arrow
IntroductionPrerequisitesGetting the DataOpening a DatasetThe Parquet FormatAdvantages of ParquetPartitioningRewriting the Seattle Library DataUsing dplyr with ArrowPerformanceUsing dbplyr with ArrowSummary
23. Hierarchical Data
IntroductionPrerequisitesListsHierarchyList ColumnsUnnestingunnest_wider()unnest_longer()Inconsistent TypesOther FunctionsExercisesCase StudiesVery Wide DataRelational DataDeeply NestedExercisesJSONData TypesjsonliteStarting the Rectangling ProcessExercisesSummary
24. Web Scraping
IntroductionPrerequisitesScraping Ethics and LegalitiesTerms of ServicePersonally Identifiable InformationCopyrightHTML BasicsElementsAttributesExtracting DataFind ElementsNesting SelectionsText and AttributesTablesFinding the Right SelectorsPutting It All TogetherStar WarsIMDb Top FilmsDynamic SitesSummary
V. Program
25. Functions
IntroductionPrerequisitesVector FunctionsWriting a FunctionImproving Our FunctionMutate FunctionsSummary FunctionsExercisesData Frame FunctionsIndirection and Tidy EvaluationWhen to Embrace?Common Use CasesData Masking Versus Tidy SelectionExercisesPlot FunctionsMore VariablesCombining with Other Tidyverse PackagesLabelingExercisesStyleExercisesSummary
26. Iteration
IntroductionPrerequisitesModifying Multiple ColumnsSelecting Columns with .colsCalling a Single FunctionCalling Multiple FunctionsColumn NamesFilteringacross() in FunctionsVersus pivot_longer()ExercisesReading Multiple FilesListing Files in a DirectoryListspurrr::map() and list_rbind()Data in the PathSave Your WorkMany Simple IterationsHeterogeneous DataHandling FailuresSaving Multiple OutputsWriting to a DatabaseWriting CSV FilesSaving PlotsSummary
27. A Field Guide to Base R
IntroductionPrerequisitesSelecting Multiple Elements with [Subsetting VectorsSubsetting Data Framesdplyr EquivalentsExercisesSelecting a Single Element with $ and [[Data FramesTibblesListsExercisesApply Familyfor LoopsPlotsSummary
VI. Communicate
28. Quarto
IntroductionPrerequisitesQuarto BasicsExercisesVisual EditorExercisesSource EditorExercisesCode ChunksChunk LabelChunk OptionsGlobal OptionsInline CodeExercisesFiguresFigure SizingOther Important OptionsExercisesTablesExercisesCachingExercisesTroubleshootingYAML HeaderSelf-ContainedParametersBibliographies and CitationsWorkflowSummary
29. Quarto Formats
IntroductionOutput OptionsDocumentsPresentationsInteractivityhtmlwidgetsShinyWebsites and BooksOther FormatsSummary
Index
About the Authors

Content preview from R for Data Science, 2nd Edition

Introduction

Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge. The goals of R for Data Science are to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly and to have some fun along the way! After reading this book, you’ll have the tools to tackle a wide variety of data science challenges using the best parts of R.

Preface to the Second Edition

Welcome to the second edition of R for Data Science (R4DS)! This is a major reworking of the first edition, removing material we no longer think is useful, adding material we wish we included in the first edition, and generally updating the text and code to reflect changes in best practices. We’re also very excited to welcome a new co-author: Mine Çetinkaya-Rundel, a noted data science educator and one of our colleagues at Posit (the company formerly known as RStudio).

A brief summary of the biggest changes follows:

The first part of the book has been renamed to “Whole Game.” The goal of this section is to give you the rough details of the “whole game” of data science before we dive into the details.
The second part of the book is “Visualize.” This part gives data visualization tools and best practices a more thorough coverage compared to the first edition. The best place to get all the details is still the ggplot2 book, but now R4DS covers more of the most important techniques.
The third part of the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492097396Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

R for Data Science, 2nd Edition

by Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund

Introduction

Preface to the Second Edition

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.