book

R for Data Science, 2nd Edition

by Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund

June 2023

Beginner to intermediate

576 pages

12h 57m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
Preface to the Second EditionWhat You Will LearnHow This Book Is OrganizedWhat You Won’t LearnModelingBig DataPython, Julia, and FriendsPrerequisitesRRStudioThe TidyverseOther PackagesRunning R CodeOther Conventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgmentsOnline Edition
I. Whole Game
1. Data Visualization
IntroductionPrerequisitesFirst StepsThe penguins Data FrameUltimate GoalCreating a ggplotAdding Aesthetics and LayersExercisesggplot2 CallsVisualizing DistributionsA Categorical VariableA Numerical VariableExercisesVisualizing RelationshipsA Numerical and a Categorical VariableTwo Categorical VariablesTwo Numerical VariablesThree or More VariablesExercisesSaving Your PlotsExercisesCommon ProblemsSummary
2. Workflow: Basics
Coding BasicsCommentsWhat’s in a Name?Calling FunctionsExercisesSummary
3. Data Transformation
IntroductionPrerequisitesnycflights13dplyr BasicsRowsfilter() Common Mistakesarrange() distinct() ExercisesColumnsmutate()select()rename() relocate() ExercisesThe PipeGroupsgroup_by()summarize()The slice_ FunctionsGrouping by Multiple VariablesUngrouping.by ExercisesCase Study: Aggregates and Sample SizeSummary
4. Workflow: Code Style
NamesSpacesPipesggplot2Sectioning CommentsExercisesSummary
5. Data Tidying
IntroductionPrerequisitesTidy DataExercisesLengthening DataData in Column NamesHow Does Pivoting Work?Many Variables in Column NamesData and Variable Names in the Column HeadersWidening DataHow Does pivot_wider() Work?Summary
6. Workflow: Scripts and Projects
ScriptsRunning CodeRStudio DiagnosticsSaving and NamingProjectsWhat Is the Source of Truth?Where Does Your Analysis Live?RStudio ProjectsRelative and Absolute PathsExercisesSummary
7. Data Import
IntroductionPrerequisitesReading Data from a FilePractical AdviceOther ArgumentsOther File TypesExercisesControlling Column TypesGuessing TypesMissing Values, Column Types, and ProblemsColumn TypesReading Data from Multiple FilesWriting to a FileData EntrySummary
8. Workflow: Getting Help
Google Is Your FriendMaking a reprexInvesting in YourselfSummary

II. Visualize
9. Layers
IntroductionPrerequisitesAesthetic MappingsExercisesGeometric ObjectsExercisesFacetsExercisesStatistical TransformationsExercisesPosition AdjustmentsExercisesCoordinate SystemsExercisesThe Layered Grammar of GraphicsSummary
10. Exploratory Data Analysis
IntroductionPrerequisitesQuestionsVariationTypical ValuesUnusual ValuesExercisesUnusual ValuesExercisesCovariationA Categorical and a Numerical VariableTwo Categorical VariablesTwo Numerical VariablesPatterns and ModelsSummary
11. Communication
IntroductionPrerequisitesLabelsExercisesAnnotationsExercisesScalesDefault ScalesAxis Ticks and Legend KeysLegend LayoutReplacing a ScaleZoomingExercisesThemesExercisesLayoutExercisesSummary
III. Transform
12. Logical Vectors
IntroductionPrerequisitesComparisonsFloating-Point ComparisonMissing Valuesis.na()ExercisesBoolean AlgebraMissing ValuesOrder of Operations%in%ExercisesSummariesLogical SummariesNumeric Summaries of Logical VectorsLogical SubsettingExercisesConditional Transformationsif_else()case_when()Compatible TypesExercisesSummary
13. Numbers
IntroductionPrerequisitesMaking NumbersCountsExercisesNumeric TransformationsArithmetic and Recycling RulesMinimum and MaximumModular ArithmeticLogarithmsRoundingCutting Numbers into RangesCumulative and Rolling AggregatesExercisesGeneral TransformationsRanksOffsetsConsecutive IdentifiersExercisesNumeric SummariesCenterMinimum, Maximum, and QuantilesSpreadDistributionsPositionsWith mutate()ExercisesSummary
14. Strings
IntroductionPrerequisitesCreating a StringEscapesRaw StringsOther Special CharactersExercisesCreating Many Strings from Datastr_c()str_glue()str_flatten()ExercisesExtracting Data from StringsSeparating into RowsSeparating into ColumnsDiagnosing Widening ProblemsLettersLengthSubsettingExercisesNon-English TextEncodingLetter VariationsLocale-Dependent FunctionsSummary
15. Regular Expressions
IntroductionPrerequisitesPattern BasicsKey FunctionsDetect MatchesCount MatchesReplace ValuesExtract VariablesExercisesPattern DetailsEscapingAnchorsCharacter ClassesQuantifiersOperator Precedence and ParenthesesGrouping and CapturingExercisesPattern ControlRegex FlagsFixed MatchesPracticeCheck Your WorkBoolean OperationsCreating a Pattern with CodeExercisesRegular Expressions in Other PlacesTidyverseBase RSummary
16. Factors
IntroductionPrerequisitesFactor BasicsGeneral Social SurveyExerciseModifying Factor OrderExercisesModifying Factor LevelsExercisesOrdered FactorsSummary
17. Dates and Times
IntroductionPrerequisitesCreating Date/TimesDuring ImportFrom StringsFrom Individual ComponentsFrom Other TypesExercisesDate-Time ComponentsGetting ComponentsRoundingModifying ComponentsExercisesTime SpansDurationsPeriodsIntervalsExercisesTime ZonesSummary
18. Missing Values
IntroductionPrerequisitesExplicit Missing ValuesLast Observation Carried ForwardFixed ValuesNaNImplicit Missing ValuesPivotingCompleteJoinsExercisesFactors and Empty GroupsSummary
19. Joins
IntroductionPrerequisitesKeysPrimary and Foreign KeysChecking Primary KeysSurrogate KeysExercisesBasic JoinsMutating JoinsSpecifying Join KeysFiltering JoinsExercisesHow Do Joins Work?Row MatchingFiltering JoinsNon-Equi JoinsCross JoinsInequality JoinsRolling JoinsOverlap JoinsExercisesSummary
IV. Import
20. Spreadsheets
IntroductionExcelPrerequisitesGetting StartedReading Excel SpreadsheetsReading WorksheetsReading Part of a SheetData TypesWriting to ExcelFormatted OutputExercisesGoogle SheetsPrerequisitesGetting StartedReading Google SheetsWriting to Google SheetsAuthenticationExercisesSummary
21. Databases
IntroductionPrerequisitesDatabase BasicsConnecting to a DatabaseIn This BookLoad Some DataDBI Basicsdbplyr BasicsSQLSQL BasicsSELECTFROMGROUP BYWHEREORDER BYSubqueriesJoinsOther VerbsExercisesFunction TranslationsSummary
22. Arrow
IntroductionPrerequisitesGetting the DataOpening a DatasetThe Parquet FormatAdvantages of ParquetPartitioningRewriting the Seattle Library DataUsing dplyr with ArrowPerformanceUsing dbplyr with ArrowSummary
23. Hierarchical Data
IntroductionPrerequisitesListsHierarchyList ColumnsUnnestingunnest_wider()unnest_longer()Inconsistent TypesOther FunctionsExercisesCase StudiesVery Wide DataRelational DataDeeply NestedExercisesJSONData TypesjsonliteStarting the Rectangling ProcessExercisesSummary
24. Web Scraping
IntroductionPrerequisitesScraping Ethics and LegalitiesTerms of ServicePersonally Identifiable InformationCopyrightHTML BasicsElementsAttributesExtracting DataFind ElementsNesting SelectionsText and AttributesTablesFinding the Right SelectorsPutting It All TogetherStar WarsIMDb Top FilmsDynamic SitesSummary
V. Program
25. Functions
IntroductionPrerequisitesVector FunctionsWriting a FunctionImproving Our FunctionMutate FunctionsSummary FunctionsExercisesData Frame FunctionsIndirection and Tidy EvaluationWhen to Embrace?Common Use CasesData Masking Versus Tidy SelectionExercisesPlot FunctionsMore VariablesCombining with Other Tidyverse PackagesLabelingExercisesStyleExercisesSummary
26. Iteration
IntroductionPrerequisitesModifying Multiple ColumnsSelecting Columns with .colsCalling a Single FunctionCalling Multiple FunctionsColumn NamesFilteringacross() in FunctionsVersus pivot_longer()ExercisesReading Multiple FilesListing Files in a DirectoryListspurrr::map() and list_rbind()Data in the PathSave Your WorkMany Simple IterationsHeterogeneous DataHandling FailuresSaving Multiple OutputsWriting to a DatabaseWriting CSV FilesSaving PlotsSummary
27. A Field Guide to Base R
IntroductionPrerequisitesSelecting Multiple Elements with [Subsetting VectorsSubsetting Data Framesdplyr EquivalentsExercisesSelecting a Single Element with $ and [[Data FramesTibblesListsExercisesApply Familyfor LoopsPlotsSummary
VI. Communicate
28. Quarto
IntroductionPrerequisitesQuarto BasicsExercisesVisual EditorExercisesSource EditorExercisesCode ChunksChunk LabelChunk OptionsGlobal OptionsInline CodeExercisesFiguresFigure SizingOther Important OptionsExercisesTablesExercisesCachingExercisesTroubleshootingYAML HeaderSelf-ContainedParametersBibliographies and CitationsWorkflowSummary
29. Quarto Formats
IntroductionOutput OptionsDocumentsPresentationsInteractivityhtmlwidgetsShinyWebsites and BooksOther FormatsSummary
Index
About the Authors

Content preview from R for Data Science, 2nd Edition

Part IV. Import

In this part of the book, you’ll learn how to import a wider range of data into R, as well as how to get it into a form useful form for analysis. Sometimes this is just a matter of calling a function from the appropriate data import package. But in more complex cases it might require both tidying and transformation to get to the tidy rectangle that you’d prefer to work with.

Our data science model with import highlighted in blue.

In this part of the book you’ll learn how to access data stored in the following ways:

In Chapter 20, you’ll learn how to import data from Excel spreadsheets and Google Sheets.
In Chapter 21, you’ll learn about getting data out of a database and into R (and you’ll also learn a little about how to get data out of R and into a database).
In Chapter 22, you’ll learn about Arrow, a powerful tool for working with out-of-memory data, particularly when it’s stored in the parquet format.
In Chapter 23, you’ll learn how to work with hierarchical data, including the deeply nested lists produced by data stored in the JSON format.
In Chapter 24, you’ll learn web “scraping,” the art and science of extracting data from web pages.

There are two important tidyverse packages that we don’t discuss here: haven and xml2. If you are working with data from SPSS, Stata, and SAS files, check out the haven ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492097396Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

R for Data Science, 2nd Edition

by Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund

Part IV. Import

Figure IV-1. Data import is the beginning of the data science process; without data you can’t do data science!

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.