book

Practical Data Science with R, Second Edition

Name: Practical Data Science with R, Second Edition
ISBN: 9781617295874

by John Mount, Nina Zumel

December 2019

Beginner to intermediate

568 pages

16h 42m

English

Manning Publications

Read now

Unlock full access

Inside front cover
Practical Data Science with R, Second Edition
Copyright
Dedication
Brief Table of Contents
Table of Contents
Praise for the First Edition
front matter
ForewordPrefaceAcknowledgmentsAbout This BookWhat is data science?RoadmapAudienceWhat is not in this book?Code conventions and downloadsWorking with this bookBook forumAbout the AuthorsAbout the Foreword AuthorsAbout the Cover Illustration
Part 1. Introduction to data science
1 The data science process
1.1. The roles in a data science project1.1.1. Project roles1.2. Stages of a data science project1.2.1. Defining the goal1.2.2. Data collection and management1.2.3. Modeling1.2.4. Model evaluation and critique1.2.5. Presentation and documentation1.2.6. Model deployment and maintenance1.3. Setting expectations1.3.1. Determining lower bounds on model performanceSummary

2 Starting with R and data
2.1. Starting with R2.1.1. Installing R, tools, and examples2.1.2. R programming2.2. Working with data from files2.2.1. Working with well-structured data from files or URLs2.2.2. Using R with less-structured data2.3. Working with relational databases2.3.1. A production-size exampleSummary
3 Exploring data
3.1. Using summary statistics to spot problems3.1.1. Typical problems revealed by data summaries3.2. Spotting problems using graphics and visualization3.2.1. Visually checking distributions for a single variable3.2.2. Visually checking relationships between two variablesSummary
4 Managing data
4.1. Cleaning data4.1.1. Domain-specific data cleaning4.1.2. Treating missing values4.1.3. The vtreat package for automatically treating missing variables4.2. Data transformations4.2.1. Normalization4.2.2. Centering and scaling4.2.3. Log transformations for skewed and wide distributions4.3. Sampling for modeling and validation4.3.1. Test and training splits4.3.2. Creating a sample group column4.3.3. Record grouping4.3.4. Data provenanceSummary
5 Data engineering and data shaping
5.1. Data selection5.1.1. Subsetting rows and columns5.1.2. Removing records with incomplete data5.1.3. Ordering rows5.2. Basic data transforms5.2.1. Adding new columns5.2.2. Other simple operations5.3. Aggregating transforms5.3.1. Combining many rows into summary rows5.4. Multitable data transforms5.4.1. Combining two or more ordered data frames quickly5.4.2. Principal methods to combine data from multiple tables5.5. Reshaping transforms5.5.1. Moving data from wide to tall form5.5.2. Moving data from tall to wide form5.5.3. Data coordinatesSummary
Part 2. Modeling methods
6 Choosing and evaluating models
6.1. Mapping problems to machine learning tasks6.1.1. Classification problems6.1.2. Scoring problems6.1.3. Grouping: working without known targets6.1.4. Problem-to-method mapping6.2. Evaluating models6.2.1. Overfitting6.2.2. Measures of model performance6.2.3. Evaluating classification models6.2.4. Evaluating scoring models6.2.5. Evaluating probability models6.3. Local interpretable model-agnostic explanations (LIME) for explai- ining model predictions6.3.1. LIME: Automated sanity checking6.3.2. Walking through LIME: A small example6.3.3. LIME for text classification6.3.4. Training the text classifier6.3.5. Explaining the classifier’s predictionsSummary
7 Linear and logistic regression
7.1. Using linear regression7.1.1. Understanding linear regression7.1.2. Building a linear regression model7.1.3. Making predictions7.1.4. Finding relations and extracting advice7.1.5. Reading the model summary and characterizing coefficient quality7.1.6. Linear regression takeaways7.2. Using logistic regression7.2.1. Understanding logistic regression7.2.2. Building a logistic regression model7.2.3. Making predictions7.2.4. Finding relations and extracting advice from logistic models7.2.5. Reading the model summary and characterizing coefficients7.2.6. Logistic regression takeaways7.3. Regularization7.3.1. An example of quasi-separation7.3.2. The types of regularized regression7.3.3. Regularized regression with glmnetSummary
8 Advanced data preparation
8.1. The purpose of the vtreat package8.2. KDD and KDD Cup 20098.2.1. Getting started with KDD Cup 2009 data8.2.2. The bull-in-the-china-shop approach8.3. Basic data preparation for classification8.3.1. The variable score frame8.4. Advanced data preparation for classification8.4.1. Using mkCrossFrameCExperiment()8.4.2. Building a model8.5. Preparing data for regression modeling8.6. Mastering the vtreat package8.6.1. The vtreat phases8.6.2. Missing values8.6.3. Indicator variables8.6.4. Impact coding8.6.5. The treatment plan8.6.6. The cross-frameSummary
9 Unsupervised methods
9.1. Cluster analysis9.1.1. Distances9.1.2. Preparing the data9.1.3. Hierarchical clustering with hclust9.1.4. The k-means algorithm9.1.5. Assigning new points to clusters9.1.6. Clustering takeaways9.2. Association rules9.2.1. Overview of association rules9.2.2. The example problem9.2.3. Mining association rules with the arules package9.2.4. Association rule takeawaysSummary
10 Exploring advanced methods
10.1. Tree-based methods10.1.1. A basic decision tree10.1.2. Using bagging to improve prediction10.1.3. Using random forests to further improve prediction10.1.4. Gradient-boosted trees10.1.5. Tree-based model takeaways10.2. Using generalized additive models (GAMs) to learn non-monotone relationships10.2.1. Understanding GAMs10.2.2. A one-dimensional regression example10.2.3. Extracting the non-linear relationships10.2.4. Using GAM on actual data10.2.5. Using GAM for logistic regression10.2.6. GAM takeaways10.3. Solving “inseparable” problems using support vector machines10.3.1. Using an SVM to solve a problem10.3.2. Understanding support vector machines10.3.3. Understanding kernel functions10.3.4. Support vector machine and kernel methods takeawaysSummary
Part 3. Working in the real world
11 Documentation and deployment
11.1. Predicting buzz11.2. Using R markdown to produce milestone documentation11.2.1. What is R markdown?11.2.2. knitr technical details11.2.3. Using knitr to document the Buzz data and produce the model11.3. Using comments and version control for running documentation11.3.1. Writing effective comments11.3.2. Using version control to record history11.3.3. Using version control to explore your project11.3.4. Using version control to share work11.4. Deploying models11.4.1. Deploying demonstrations using Shiny11.4.2. Deploying models as HTTP services11.4.3. Deploying models by export11.4.4. What to take awaySummary
12 Producing effective presentations
12.1. Presenting your results to the project sponsor12.1.1. Summarizing the project’s goals12.1.2. Stating the project’s results12.1.3. Filling in the details12.1.4. Making recommendations and discussing future work12.1.5. Project sponsor presentation takeaways12.2. Presenting your model to end users12.2.1. Summarizing the project goals12.2.2. Showing how the model fits user workflow12.2.3. Showing how to use the model12.2.4. End user presentation takeaways12.3. Presenting your work to other data scientists12.3.1. Introducing the problem12.3.2. Discussing related work12.3.3. Discussing your approach12.3.4. Discussing results and future work12.3.5. Peer presentation takeawaysSummary
Appendix A. Starting with R and other tools
A.1. Installing the toolsA.1.1. Installing ToolsA.1.2. The R package systemA.1.3. Installing GitA.1.4. Installing RStudioA.1.5. R resourcesA.2. Starting with RA.2.1. Primary features of RA.2.2. Primary R data typesA.3. Using databases with RA.3.1. Running database queries using a query generatorA.3.2. How to think relationally about dataA.4. The takeaway
Appendix B. Important statistical concepts
B.1. DistributionsB.1.1. Normal distributionB.1.2. Summarizing R’s distribution naming conventionsB.1.3. Lognormal distributionB.1.4. Binomial distributionB.1.5. More R tools for distributionsB.2. Statistical theoryB.2.1. Statistical philosophyB.2.2. A/B testsB.2.3. Power of testsB.2.4. Specialized statistical testsB.3. Examples of the statistical view of dataB.3.1. Sampling biasB.3.2. Omitted variable biasB.4. The takeaway
Appendix C. Bibliography
Index
List of Figures
List of Tables
List of Listings

Overview

Practical Data Science with R, Second Edition takes a practice-oriented approach to explaining basic principles in the ever expanding field of data science. You’ll jump right to real-world use cases as you apply the R programming language and statistical analysis techniques to carefully explained examples based in marketing, business intelligence, and decision support.

About the Technology

Evidence-based decisions are crucial to success. Applying the right data analysis techniques to your carefully curated business data helps you make accurate predictions, identify trends, and spot trouble in advance. The R data analysis platform provides the tools you need to tackle day-to-day data analysis and machine learning tasks efficiently and effectively.

About the Book

Practical Data Science with R, Second Edition is a task-based tutorial that leads readers through dozens of useful, data analysis practices using the R language. By concentrating on the most important tasks you’ll face on the job, this friendly guide is comfortable both for business analysts and data scientists. Because data is only useful if it can be understood, you’ll also find fantastic tips for organizing and presenting data in tables, as well as snappy visualizations.

What's Inside

Statistical analysis for business pros
Effective data presentation
The most useful R tools
Interpreting complicated predictive models

About the Reader

You’ll need to be comfortable with basic statistics and have an introductory knowledge of R or another high-level programming language.

About the Authors

Nina Zumel and John Mount founded a San Francisco–based data science consulting firm. Both hold PhDs from Carnegie Mellon University and blog on statistics, probability, and computer science.

Quotes
Full of useful shared experience and practical advice. Highly recommended.
- From the Foreword by Jeremy Howard and Rachel Thomas

Great examples and an informative walk-through of the data science process.
- David Meza, NASA

Offers interesting perspectives that cover many aspects of practical data science; a good reference.
- Pascal Barbedor, BL SET

R you ready to get data science done the right way?
- Taylor Dolezal, Disney Studios

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781617295874Publisher Support Publisher Website Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills