Doing Data Science

Errata for Doing Data Science

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.


Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update



Version Location Description Submitted By Date Submitted
Printed Page 38-39

Hello. In the data file for the first exercise, all of the records with SIGNED_IN==FALSE have the value 0 for both the Age and Gender fields. These would seem to be meaningless default values, and in the sample code there is a line to produce a summary of the data that ignores them: summaryBy(Gender+Signed_In+Impressions+Clicks~agecat, data=data1) But what the sample code doesn't seem to do is to replace these placeholder values with either nulls or with values that are less likely to be mistaken for good data (e.g., -99). It seems confusing to have a column where 0 means female, unless some other column is also 0, in which case it means unknown. My instinct would be to clean that up, but the fact that you don't seem to suggest doing so here makes me wonder if there is a reason not to take this approach. Thanks.

George Schneiderman  Jun 17, 2016 
Printed Page 39
in the code

On page 39 there is code that is supposed to cut users as "<18", "18-24", ... etc. The code given is: data1$agecat<- cut(data1$Age,c(-Inf,0,18,24,34,44,54,64,Inf)) The intervals created from this code are: (-Inf, 0], (0, 18], (18,24], ... etc. The problem is that 18 is included in the under 18 group using this code. Also, 0 is separated from the other users who are under 18.

Jason Scott  Aug 13, 2015 
Printed Page 39
Near top of code segment

On p. 38 the task is to separate users by age into <18, 18-24, 25-34, 35-44, 45-54, 55-64 and 65+. The sample code creates field agecat with this line: data1$agecat<-cut(data1$Age,c(-Inf,0,18,24,34,44,54,64,Inf)) But this sorts users into categories < 19, 19-24, 25-34, etc. The code should be data1$agecat<-cut(data1$Age,c(-Inf,0,19,24,34,44,54,64,Inf))

JD Baldwin  Sep 23, 2017 
Printed Page 48
numbered paragraph 2, first sentence

The parenthetical reads, "typical in a startup when its still building its product". That first "its" should be "it's".

George Schneiderman  Jun 17, 2016 
Printed Page 50
4th statement from the bottom

bk.homes[which(bk.homes$sale.price.n<100000),] [order(bk.homes[which(bk.homes$sale.price.n<100000),] $sale.price.n),] throws the following error: 21769 1440 1250 21785 2740 1962 22760 2300 1283 23098 2080 2003 23117 2550 1800 > [order(bk.homes[which(bk.homes$sale.price.n<100000),] Error: unexpected '[' in " ["

Mahboob Hussain  Sep 19, 2015 
Printed Page 86
Last line

> require(geoPlot) Loading required package: geoPlot Warning message: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : there is no package called ‘geoPlot’

Mahboob Hussain  Sep 23, 2015 
Printed Page 87
12th line from bottom

> mt$address.noapt <- gsub("[,][[:print:]]*", "", + gsub(("[ ]+", " ", trim(mt$address))) Error: unexpected ',' in: "mt$address.noapt <- gsub("[,][[:print:]]*", "", gsub(("[ ]+","

Mahboob Hussain  Sep 23, 2015 
PDF Page 111
Sample R Code for Dealing with the NYT API

The New York Times API has changed. Thus nearly everything in the code sample must be reworked. e.g. res1$results becomes res1$response$docs

Steven  Dec 08, 2015 
Printed Page 126
3rd line

broken link. Article is no more to be found for free.

Buno Betoni Parodi  Jan 06, 2018 
Printed Page 162
Code Snipit

datapath <- "http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz" Does this path still exist? If not... Can we have the dataset included at the following URL? https://github.com/oreillymedia/doing_data_science

Anonymous  Jun 22, 2017