Doing Data Science

Errata for Doing Data Science

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update



Version Location Description Submitted By Date Submitted Date Corrected
Printed
Page multiple
multiple

This errata was submitted by Philipp Marek via email. Errata for Doing data science I mark /deletions/, and *changes*. This is in UTF8 -- so eg. a CRLF is shown as down-left pointing arrow: &#8629; xvii: move 3 words: there is more breath // than depth *in some cases* xxi: Forgot to mention "Visual Display of Quantitative Information" ... although listed on p37 2: statis/i/tican 14: 1-4 use different shades of gray, or dashes or something like that 30: observed real-world phenomen*a* (or *a* phenomenon) 32: x in seconds? Don't integrate over minutes 38: http://stat.columbia.edu - everything else on github 43: hypo-thesis, not th-esis (?) 2-3 Huma*n* behavi*or* (nouns) Trying to read associations fails; put Olympics beneath Olympic records? 44: an extension /of/ or variation of 48: an-swered? 49: Did Doug use ... (... "CPC") -- aren't used in text, no need to explain 50: plot(log(), log()): see http://spacecraft.ssl.umd.edu/akins_laws.html, twice. "6. (Mar's Law) Everything is linear if plotted log-log with a fat magic marker." bk.homes[which ...] -- indentation of 3rd line wrong log() <= 5 ... better use <= 1e5 or 100e3 and remove log() 68: 3-6 truth = d*e*gree 2 (top right) 69: x*_2* * x_3 71: x&#8321;�, not x�&#8321; 72: you'd have establish*ed* the bins (or have *to* establish) 73: 3-7 doesn't include the points listed above 74: 3-8 use "x" for new guy, this point is already in 3-7 76: Hamming: shoe +s-s => hose, distance is 2 we start with a Google search ... *which to use*. 77: n.points = length(data) Why not simply use a boolean vector of some length on data? swap lines: train <- and #define 78: swap cl <- and # swap true.labels and # 79: # We're using ... comment not helpful 85: http://abt.cm -- why a different link shortener? we showed how *to* explore and clean 87: remove line setwd() 90: U of Edinb*o*rough? 101: parallel/-/ly 108: WWW::Mechanize, and generally Perl for text extraction 111,112: script could use a few functions 117: *An Empirical...* format different from other book references or titles 129: "non discrete)" is still a comment, wrong format used c[, 2] - space before "," missing 131: vlist <- use less space to avoid line break, twice 132: "use holdout group" join to previous line "vars" within for loop? 137: prop/o/agates 140: 6-3 no colors visible. use distinguisable grays? 141: 6-4 no counts visible 147: what does 6-7 show? 151: 6-8 label both axes with text 155: 6-12 factors not distinguishable 156: this_E is unused 176: "Director of Research..." in one line 177: the modeling part isn't *what* we want 183: AIC Info*r*mation 184: a college studen ... spend *her* time 191: "column which is our response" is still a comment, has wrong format 194: "Google's Hybrid Approach" title => italic 201: simple but comp*l*ete 215: vr = indentation wrong 236: to a*c*cept 241: the Predicted=False row should have FN, TN 246: 2nd mouse/keyboard is not needed, other person should read and think, not type simultaneously 247: discrep*a*ncy 251: partic-ipate ? 254: digital media at&#9251;Columbia (space missing) 281: "Overlapping..." title => italic 287: people that take/s/ some drug -- people take, not the population 293: "Oral..." title => italic 304: (hers is shown ... *)* 341: line 44 is hard to read, code doesn't match other formatting 349: Map*R*educe 351ff: Index: "Amazon Mechanical Turk" in Amazon bunch together "causal ..." bunch together "chaos ..." "Protocol buffers" instead of "prtobuf" and probably some more.

Note from the Author or Editor:
32: x in seconds? Don't integrate over minutes Cathy: change the "measured in seconds" to "measured in minutes" in the above paragraph. 50: plot(log(), log()): see http://spacecraft.ssl.umd.edu/akins_laws.html, twice. "6. (Mar's Law) Everything is linear if plotted log-log with a fat magic marker." bk.homes[which ...] -- indentation of 3rd line wrong log() <= 5 ... better use <= 1e5 or 100e3 and remove log() Cathy: please indent after the "bk.homes" line as the above lines are indented. Otherwise fine to ignore these suggestions. 73: 3-7 doesn't include the points listed above Cathy: That's true. We might wanna change it to be more reasonable. I don't have the original data that was plotted here. 74: 3-8 use "x" for new guy, this point is already in 3-7 Cathy: Can erase the "?" point in 3-7 for clarity. 76: Hamming: shoe +s-s => hose, distance is 2 Cathy: this is false. Ignore. 77: n.points = length(data) Why not simply use a boolean vector of some length on data? Cathy: ignore 108: WWW::Mechanize, and generally Perl for text extraction Cathy: ignore. 111,112: script could use a few functions Cathy: please write your own book with a few functions. 141: 6-4 no counts visible Cathy: that's ok. 147: what does 6-7 show? Cathy: X-axis should be labeled "time in seconds" 246: 2nd mouse/keyboard is not needed, other person should read and think, not type simultaneously Cathy: ludicrous comment. Ignore. Also this is on page 245.

Rachel Schutt
O'Reilly Author 
Nov 20, 2013  Dec 13, 2013
PDF
Page multiple
multiple

Page Error Note p.207 star-up should be "start-up" p.359 want achieve should be "want to achieve" p.162-163 section headers are different sizes "Exercise: GetGlue and Timestamped Event Data" and "Exercise: Financial Data" should be same size font p.68 dgree In figure 3-6, should be "degree" p.32-33 inconsistent capitalization of random variables: x vs X p.21-22 indentation is odd and seems arbitrary index curse of dimensionality missing p.282 "That experimental infrastructure" strange phrasing

Rachel Schutt
O'Reilly Author 
Nov 20, 2013  Dec 13, 2013
PDF
Page xxi
2nd bullet

Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press) It is not Alpayd&#305;m, it's Alpayd&#305;n.

Tolga Bakkaloglu  Jan 01, 2014 
PDF
Page xx
Line 1-2 from the top

There is a typo error in surname "Vendenberghe". The proper surname sounds: Vandenberghe.

Zdzislaw Ploski  Aug 13, 2014 
PDF
Page 49
last line

In order to use the count() function in the line: count(is.na(bk$SALE.PRICE.N)) you need to use plyr package (see http://www.miskatonic.org/2012/09/24/counting-and-aggregating-r/) library(plyr) (I'm using R Studio on a Mac, and the count() function doesn't work without first specifying the plyr library)

Note from the Author or Editor:
please add library(plyr) before "require(gdata)" and after "# Author: Benjamin Reddy," on its own line.

Shafique Jamal  Jan 26, 2014 
Printed
Page 69
model formula

please change: model <- lm(y ~ x_1 + x_2 + x_3 + x2_*x_3) to model <- lm(y ~ x_1 + x_2 + x_3 + x_2:x_3) or model <- lm(y ~ x_1 + x_2*x_3)

Note from the Author or Editor:
Happy to change this to model <- lm(y ~ x_1 + x_2*x_3) for simplicity, but I don't think it was wrong as is.

Matthias Kohl  Dec 14, 2013 
PDF
Page 88
3rd line of code

This line doesn't seem to put any info in the variable 'dup_add': dup_add <- mt_add[mt_add$dup,1] After entering this line, when I type 'dup_add' I get: character(0) And when I type 'dup_add[1]' or 'dup_add[2]' I get the following output: [1] NA I think that the formula for 'dup_add' given is wrong. Can you confirm? To get rid of the duplicates, here is what I did instead: dup_add2 <- mt_add[which(dup==TRUE),] mt_add2 <- mt_add[(mt_add$address.noapt != dup_add2[[1]][1] & mt_add$address.noapt != dup_add2[[1]][2]),] (This drops 4 observations (rows) instead of two - it drops BOTH copies of EACH duplicate. I'm a novice with R, so I still have to figure out how to drop only ONE copy of each duplicate, for a total of 2 rows dropped)

Note from the Author or Editor:
This line should read: dup_add <- mt_add[dup,1]

Shafique Jamal  Jan 28, 2014 
PDF
Page 95
2d paragraph 1st sentence

"Thinking back to the previous chapter, in order to use liner regression,..." should be 'linear'

donald f caldwell  Dec 01, 2013  Dec 13, 2013
Printed
Page 103
1st para in section "Fancying..." and subsequent

The notation for the number of occurances of the jth word in all emails would not be ambiguous if it were n_j ("n subscript j"), rather than n_c. Otherwise the ratios of counts used to compute probabilities theta_j and theta_k for two words, j and k, in spam would seem to have the same denominator. Thus, it is better to write, p(word_j,spam) = theta_j = n_jc/n_j and p(word_k,spam) = theta_k = n_kc/n_k n_c seems more likely to represent the number of spam emails.

Note from the Author or Editor:
This is terrible notation! Please change all the n_{j c} to n_{j s} and please also change all the n_{c} to n_{j c}. This is for the entire section called "Fancy It Up: Laplace Smoothing" So it should read "where n_js denotes the number of times that word appears in a spam email and n_jc denotes the number of times that word appears in any email"

leif wennerberg  May 25, 2014 
Printed
Page 103
para before last equation

Delete the equal sign and right side of the equation in the first sentence. It should read, "... values theta_j is the answer...". Otherwise the question following makes no sense: the lead in has answered it.

Note from the Author or Editor:
Agreed! That sentence should start: In other words, the vector of values θ_j is the answer...

leif wennerberg  May 25, 2014 
PDF
Page 108
lines 17-21

On page 108 a part of paragraph is repeated (four lines from "Represent each image..." to "...between 0 and 255").

Zdzis&#322;aw P&#322;oski  May 06, 2014 
ePub
Page 119
United States

w.r.t. to my just submitted errata, it appears that its my github ignorance. Shift clicking on the file doesn't have the obvious semantics, but the button on the right side of the pane "download zipfile" does. So my request would be for a slight change to the text to make this clear for us cvs, sccs, svn, bitkeeper folks who didn't get with Git.

Note from the Author or Editor:
Github Readme adjusted to indicate Download Zip button.

Keith Bierman  Oct 31, 2013  Dec 03, 2013
PDF
Page 119
Line 14 from the top

There is: "Recall that in Chapter 3". There should be "Recall that in Chapter 4".

Zdzislaw Ploski  Jul 31, 2014 
ePub
Page 121
United States

The equation after "In order to model the data, you need to work with a slightly more general function that expresses the relationship between the data and a probability of a click. Start by defining:" reads simply "z". The PDF is fine; only the ePub is affected. But this makes this part of the ePub incomprehensible.

Adam Merberg  Aug 29, 2014 
PDF
Page 153
Line 16 from the bottom

The expression -ln(2) begins from hyphen. It should begin from minus sign.

Note from the Author or Editor:
I don't know the difference between a hyphen and a minus sign! So let's go with it, what the heck.

Zdzislaw Ploski  Aug 01, 2014 
PDF
Page 156
lines 10-9 from the bottom

There is: "your buying it has actually changed the process, through your market impact, and decreased the signal". Is it right? "Decreased"? Not "increased"?

Note from the Author or Editor:
This sentence should read: "But if you think about it, your buying it has actually changed the process, through your market impact, and decreased the signal you were anticipating, at least if the other market players bought it because it looked cheap to them at the previous price; you brought up the price a bit, so you might expect them to buy less in response, which means the overall signal is smaller."

Zdzislaw Ploski  May 24, 2014 
PDF
Page 160
Line 10 from the bottom

There is: "we solve for beta to get". Should be: "we solve for <Greek letter 'beta'> to get" as in several other places before.

Zdzislaw Ploski  May 25, 2014 
PDF
Page 161
Line 10 from the bottom (inside formula)

There is: ")/". Should be: ")".

Zdzislaw Ploski  May 25, 2014 
PDF
Page 162
line 16 from the bottom

In the sentence: "Heres some R code to look at the first 10 rows in R" words "in R" are redundant.

Note from the Author or Editor:
change to "Here's some R code to look at the first 10 rows"

Zdzislaw Ploski  May 19, 2014 
PDF
Page 173
Line 7 from the bottom

There is: ascii. Better notation here: ASCII.

Zdzislaw Ploski  Aug 03, 2014 
PDF
Page 191
Line 17 from the bottom (not counting an interline)

There is "fir"in place of predicate in comment. Does it mean "fire"?

Zdzislaw Ploski  Aug 03, 2014 
PDF
Page 200
16g

There is: "They were questions". There should be, I think: "There were questions".

Zdzislaw Ploski  Jun 03, 2014 
PDF
Page 201
Lines 9-10 from the top

In the sentence: "there are lines from a user to an item if that user has expressed an opinion about that item" words "are lines" should be replaced by "is a line" (cp. Fig. 8-1).

Zdzislaw Ploski  Jun 04, 2014 
PDF
Page 204
Lines 4 and 1 from the bottom

The order of indexes concerning three "f" (user attributes) is inverted (cp. their order three lines above). Is it correct?

Note from the Author or Editor:
Yes the last line on page 204 should read p_i = \beta_1 f_{i, 1} + \beta_2 f_{i,2}+ \beta_3 f_{i, 3}. right now we see "f_{1, i}" instead of "f_{i, 1}" for example.

Zdzislaw Płoski  Aug 04, 2014 
PDF
Page 205
17 from the top

There is: "the coefficients on one can be 100,000". There should (?) be: "the coefficient on one can be 100,000".

Zszislaw Ploski  Jun 03, 2014 
PDF
Page 209
lines 2-1 from the bottom

In the sentence: "the age vectors of all the users will be a row in V" is the plural of '"vector" correct? Why "age vector"? Isn't age a scalar value?

Note from the Author or Editor:
The parenthetical phrase at the bottom of the page should read: so the vector of ages of all the users will be a row in V

Zdzislaw Ploski  Jun 03, 2014 
PDF
Page 222
Line 6 form the top

There is: "a cool example of how ideally, data science integrates". Should be: "a cool example of how ideally data science integrates".

Zdzislaw Ploski  Jun 14, 2014 
PDF
Page 229
Lines 5, 10 and 11 from the to

In line 5 there is "bit.ly". In lines 10 and 11: "bitly". Suggestion: use uniform notation everywhere.

Zdzislaw Ploski  Jun 14, 2014 
PDF
Page 245
line 8 from the bottom

There is: "using git. Learn about git". Better: "using Git. Learn about Git".

Zdzislaw Ploski  Jun 09, 2014 
PDF
Page 269
Lines 19-20 from the top

There is a typo error in surname "Kolazcyk". The proper surname sounds: Kolaczyk.

Zdzislaw Ploski  Jun 21, 2014 
PDF
Page 277
Line 6 from the bottom

There is : "say on". Should be: "say in" (cp. appropriate site)..

Zdzislaw Ploski  Jun 24, 2014 
Printed
Page 287
last line

the causal effect is 10 percentage points, not 10%.

Note from the Author or Editor:
Correct, it should read "10 percentage points."

Stephanie Eckman  Sep 11, 2014 
PDF
Page 296
lines 5-7 from the top

In the quotation: "After adjustment for length of use, users of oral contraceptives were at least twice the risk of clotting compared with users of other kinds of oral contraceptives" lacks at least the phrase: "with desogestrel, gestodene, or drospirenone", otherwise the quoted sentence is not clear. The end of the quoted sentence is also changed (shortened) without any remark.

Note from the Author or Editor:
The quote in question should be adjusted to read: After adjustment for length of use, users of oral contraceptives with desogestrel, gestodene, or drospirenone were at least at twice the risk of venous thromboembolism compared with users of oral contraceptives with levonorgestrel.

Zdzislaw Ploski  Jun 27, 2014 
PDF
Page 298
Lines 1-2 from the top

The sentence "The kinds of decisions they tweaked were of the following types" sounds not good due to these "kinds of the types". Perhaps "The kinds of decisions they tweaked were as follows" would be better.

Zdzislaw Ploski  Jun 27, 2014 
PDF
Page 301
Line 4 from the bottom

There is a word "medicare" (starting from a lower case "m"). Is it about Medicare (cp. www.medicare.gov)?

Zdzislaw Ploski  Jun 28, 2014 
PDF
Page 313
Line 16 from the top

Is the word "clean" mandatory in the sentence "The best practice is to start from scratch with clean, raw data"? Isn't "clean data"an antithesis of 'raw data" in the context of the book?

Note from the Author or Editor:
Please replace the word "clean" with the word "unfiltered".

Zdzislaw Ploski  Aug 10, 2014 
PDF
Page 315
Lines 6-7 from the top

In the sentence "if the vast majority is of binary outcomes are 1" is the word "is" mandatory?

Note from the Author or Editor:
delete "is" from sentence

Zdzislaw Ploski  Jun 30, 2014 
PDF
Page 319
Lines 15-16 from the top

In the sentence: "Youd like to save money and only send money to people who are likely to give" the second word "money" should be replaced with "letter".

Note from the Author or Editor:
change sentence to "...and only send a letter to people..."

Zdzislaw Ploski  Jun 30, 2014 
PDF
Page 323
Paragraph that starts

Please footnote the end of that first sentence as follows: By some estimates, one or two patients died per week in a certain smallish town because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic \footnote{Andrew Gelman thinks this parable is unlikely, and he wrote up a response which you can read here: http://andrewgelman.com/2014/01/24/parables-vs-data/.}.

Cathy O'Neil
O'Reilly Author 
Sep 25, 2014 
PDF
Page 329
Line 17 from the bottom

What does it mean: "to shave off nanoseconds 10^-9"? That nanosecond equals 10^-9 of a second? (It is). 10^-9 of [one] nanosecond?? Something else?

Note from the Author or Editor:
This sentence should read: Once you get into the optimization process, you find yourself tuning MapReduce jobs to shave off nanoseconds from repetitive processes because you're dealing with petabytes of data.

Zdzislaw Ploski  Jul 03, 2014 
PDF
Page 330
Lines 2-4 from the top

There is useless redundancy in the sentence: "a record with a person living in zip code 90210 who clicked on an ad would get emitted to (90210,{1,1}) if that person saw an ad and clicked, or (90210,{0,1}) if they saw an ad and didnt click.". Two times is written that a person clicked on an ad.

Note from the Author or Editor:
change sentence to: "You could run MapReduce keyed by zip code so that a record with a person living in zip code 90210 would get emitted to (90210,{1,1}) if that person saw an ad and clicked, or (90210,{0,1}) if they saw an ad and didnt click."

Zdzislaw Ploski  Jul 03, 2014 
PDF
Page 330
Line 13 from the top

Does the expression ((90210], user_5321} <- {1,1} is correct? What about the correctness of parentheses?

Note from the Author or Editor:
That expression should be rewritten as: ({90210,user_5321}, {1, 1})

Zdzislaw Ploski  Aug 11, 2014 
PDF
Page 334
Lines16-15 from the bottom

Something is lack in the sentence: "Writing MapReduce in the Java API not pleasant". Lack of predicate?

Note from the Author or Editor:
"Writing MapReduce in the Java API is not pleasant."

Zdzislaw Ploski  Jul 05, 2014 
PDF
Page 335
Lines 12-11 from the bottom

There is: "Github". Should be: "GitHub".

Zdzislaw Ploski  Jul 04, 2014 
PDF
Page 341
Line 10 from the top

There is: "git". Should be: "Git".

Zdzislaw Ploski  Jul 07, 2014 
PDF
Page 344
Line 19 from the top

There is: "In addition". Should be: ". In addition".

Note from the Author or Editor:
add period between equation and "In addition" as indicated

Zdzislaw Ploski  Jul 07, 2014