book

R for Data Science, 2nd Edition

by Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund

June 2023

Beginner to intermediate

576 pages

12h 57m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
Preface to the Second EditionWhat You Will LearnHow This Book Is OrganizedWhat You Won’t LearnModelingBig DataPython, Julia, and FriendsPrerequisitesRRStudioThe TidyverseOther PackagesRunning R CodeOther Conventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgmentsOnline Edition
I. Whole Game
1. Data Visualization
IntroductionPrerequisitesFirst StepsThe penguins Data FrameUltimate GoalCreating a ggplotAdding Aesthetics and LayersExercisesggplot2 CallsVisualizing DistributionsA Categorical VariableA Numerical VariableExercisesVisualizing RelationshipsA Numerical and a Categorical VariableTwo Categorical VariablesTwo Numerical VariablesThree or More VariablesExercisesSaving Your PlotsExercisesCommon ProblemsSummary
2. Workflow: Basics
Coding BasicsCommentsWhat’s in a Name?Calling FunctionsExercisesSummary
3. Data Transformation
IntroductionPrerequisitesnycflights13dplyr BasicsRowsfilter() Common Mistakesarrange() distinct() ExercisesColumnsmutate()select()rename() relocate() ExercisesThe PipeGroupsgroup_by()summarize()The slice_ FunctionsGrouping by Multiple VariablesUngrouping.by ExercisesCase Study: Aggregates and Sample SizeSummary
4. Workflow: Code Style
NamesSpacesPipesggplot2Sectioning CommentsExercisesSummary
5. Data Tidying
IntroductionPrerequisitesTidy DataExercisesLengthening DataData in Column NamesHow Does Pivoting Work?Many Variables in Column NamesData and Variable Names in the Column HeadersWidening DataHow Does pivot_wider() Work?Summary
6. Workflow: Scripts and Projects
ScriptsRunning CodeRStudio DiagnosticsSaving and NamingProjectsWhat Is the Source of Truth?Where Does Your Analysis Live?RStudio ProjectsRelative and Absolute PathsExercisesSummary
7. Data Import
IntroductionPrerequisitesReading Data from a FilePractical AdviceOther ArgumentsOther File TypesExercisesControlling Column TypesGuessing TypesMissing Values, Column Types, and ProblemsColumn TypesReading Data from Multiple FilesWriting to a FileData EntrySummary
8. Workflow: Getting Help
Google Is Your FriendMaking a reprexInvesting in YourselfSummary

II. Visualize
9. Layers
IntroductionPrerequisitesAesthetic MappingsExercisesGeometric ObjectsExercisesFacetsExercisesStatistical TransformationsExercisesPosition AdjustmentsExercisesCoordinate SystemsExercisesThe Layered Grammar of GraphicsSummary
10. Exploratory Data Analysis
IntroductionPrerequisitesQuestionsVariationTypical ValuesUnusual ValuesExercisesUnusual ValuesExercisesCovariationA Categorical and a Numerical VariableTwo Categorical VariablesTwo Numerical VariablesPatterns and ModelsSummary
11. Communication
IntroductionPrerequisitesLabelsExercisesAnnotationsExercisesScalesDefault ScalesAxis Ticks and Legend KeysLegend LayoutReplacing a ScaleZoomingExercisesThemesExercisesLayoutExercisesSummary
III. Transform
12. Logical Vectors
IntroductionPrerequisitesComparisonsFloating-Point ComparisonMissing Valuesis.na()ExercisesBoolean AlgebraMissing ValuesOrder of Operations%in%ExercisesSummariesLogical SummariesNumeric Summaries of Logical VectorsLogical SubsettingExercisesConditional Transformationsif_else()case_when()Compatible TypesExercisesSummary
13. Numbers
IntroductionPrerequisitesMaking NumbersCountsExercisesNumeric TransformationsArithmetic and Recycling RulesMinimum and MaximumModular ArithmeticLogarithmsRoundingCutting Numbers into RangesCumulative and Rolling AggregatesExercisesGeneral TransformationsRanksOffsetsConsecutive IdentifiersExercisesNumeric SummariesCenterMinimum, Maximum, and QuantilesSpreadDistributionsPositionsWith mutate()ExercisesSummary
14. Strings
IntroductionPrerequisitesCreating a StringEscapesRaw StringsOther Special CharactersExercisesCreating Many Strings from Datastr_c()str_glue()str_flatten()ExercisesExtracting Data from StringsSeparating into RowsSeparating into ColumnsDiagnosing Widening ProblemsLettersLengthSubsettingExercisesNon-English TextEncodingLetter VariationsLocale-Dependent FunctionsSummary
15. Regular Expressions
IntroductionPrerequisitesPattern BasicsKey FunctionsDetect MatchesCount MatchesReplace ValuesExtract VariablesExercisesPattern DetailsEscapingAnchorsCharacter ClassesQuantifiersOperator Precedence and ParenthesesGrouping and CapturingExercisesPattern ControlRegex FlagsFixed MatchesPracticeCheck Your WorkBoolean OperationsCreating a Pattern with CodeExercisesRegular Expressions in Other PlacesTidyverseBase RSummary
16. Factors
IntroductionPrerequisitesFactor BasicsGeneral Social SurveyExerciseModifying Factor OrderExercisesModifying Factor LevelsExercisesOrdered FactorsSummary
17. Dates and Times
IntroductionPrerequisitesCreating Date/TimesDuring ImportFrom StringsFrom Individual ComponentsFrom Other TypesExercisesDate-Time ComponentsGetting ComponentsRoundingModifying ComponentsExercisesTime SpansDurationsPeriodsIntervalsExercisesTime ZonesSummary
18. Missing Values
IntroductionPrerequisitesExplicit Missing ValuesLast Observation Carried ForwardFixed ValuesNaNImplicit Missing ValuesPivotingCompleteJoinsExercisesFactors and Empty GroupsSummary
19. Joins
IntroductionPrerequisitesKeysPrimary and Foreign KeysChecking Primary KeysSurrogate KeysExercisesBasic JoinsMutating JoinsSpecifying Join KeysFiltering JoinsExercisesHow Do Joins Work?Row MatchingFiltering JoinsNon-Equi JoinsCross JoinsInequality JoinsRolling JoinsOverlap JoinsExercisesSummary
IV. Import
20. Spreadsheets
IntroductionExcelPrerequisitesGetting StartedReading Excel SpreadsheetsReading WorksheetsReading Part of a SheetData TypesWriting to ExcelFormatted OutputExercisesGoogle SheetsPrerequisitesGetting StartedReading Google SheetsWriting to Google SheetsAuthenticationExercisesSummary
21. Databases
IntroductionPrerequisitesDatabase BasicsConnecting to a DatabaseIn This BookLoad Some DataDBI Basicsdbplyr BasicsSQLSQL BasicsSELECTFROMGROUP BYWHEREORDER BYSubqueriesJoinsOther VerbsExercisesFunction TranslationsSummary
22. Arrow
IntroductionPrerequisitesGetting the DataOpening a DatasetThe Parquet FormatAdvantages of ParquetPartitioningRewriting the Seattle Library DataUsing dplyr with ArrowPerformanceUsing dbplyr with ArrowSummary
23. Hierarchical Data
IntroductionPrerequisitesListsHierarchyList ColumnsUnnestingunnest_wider()unnest_longer()Inconsistent TypesOther FunctionsExercisesCase StudiesVery Wide DataRelational DataDeeply NestedExercisesJSONData TypesjsonliteStarting the Rectangling ProcessExercisesSummary
24. Web Scraping
IntroductionPrerequisitesScraping Ethics and LegalitiesTerms of ServicePersonally Identifiable InformationCopyrightHTML BasicsElementsAttributesExtracting DataFind ElementsNesting SelectionsText and AttributesTablesFinding the Right SelectorsPutting It All TogetherStar WarsIMDb Top FilmsDynamic SitesSummary
V. Program
25. Functions
IntroductionPrerequisitesVector FunctionsWriting a FunctionImproving Our FunctionMutate FunctionsSummary FunctionsExercisesData Frame FunctionsIndirection and Tidy EvaluationWhen to Embrace?Common Use CasesData Masking Versus Tidy SelectionExercisesPlot FunctionsMore VariablesCombining with Other Tidyverse PackagesLabelingExercisesStyleExercisesSummary
26. Iteration
IntroductionPrerequisitesModifying Multiple ColumnsSelecting Columns with .colsCalling a Single FunctionCalling Multiple FunctionsColumn NamesFilteringacross() in FunctionsVersus pivot_longer()ExercisesReading Multiple FilesListing Files in a DirectoryListspurrr::map() and list_rbind()Data in the PathSave Your WorkMany Simple IterationsHeterogeneous DataHandling FailuresSaving Multiple OutputsWriting to a DatabaseWriting CSV FilesSaving PlotsSummary
27. A Field Guide to Base R
IntroductionPrerequisitesSelecting Multiple Elements with [Subsetting VectorsSubsetting Data Framesdplyr EquivalentsExercisesSelecting a Single Element with $ and [[Data FramesTibblesListsExercisesApply Familyfor LoopsPlotsSummary
VI. Communicate
28. Quarto
IntroductionPrerequisitesQuarto BasicsExercisesVisual EditorExercisesSource EditorExercisesCode ChunksChunk LabelChunk OptionsGlobal OptionsInline CodeExercisesFiguresFigure SizingOther Important OptionsExercisesTablesExercisesCachingExercisesTroubleshootingYAML HeaderSelf-ContainedParametersBibliographies and CitationsWorkflowSummary
29. Quarto Formats
IntroductionOutput OptionsDocumentsPresentationsInteractivityhtmlwidgetsShinyWebsites and BooksOther FormatsSummary
Index
About the Authors

Content preview from R for Data Science, 2nd Edition

Chapter 15. Regular Expressions

Introduction

In Chapter 14, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use regular expressions, a concise and powerful language for describing patterns within strings. The term regular expression is a bit of a mouthful, so most people abbreviate it to regex¹ or regexp.

The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.

Prerequisites

In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package:

library(tidyverse)
library(babynames)

Through this chapter, we’ll use a mix of simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:

fruit contains the names of 80 fruits.
words contains 980 common English words.
sentences contains 720 short sentences.

Pattern Basics

We’ll use str_view() to learn how ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492097396Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business