Errata

Errata for Data Science from Scratch

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date submitted
Other Digital Version	loc 2048 The Central Limit Theorem (chapter 6)	In the Kindle book, bernoulli binomial is incorrectly defined as def binomial(n,p): return sum(bernoulli_trial(p) for _ in range(n)) .... while in the code repo, it is correct: def binomial(p, n): return sum(bernoulli_trial(p) for _ in range(n)) the erroneous transposition is confusing, as later an example is given: make_hist(0.75, 100, 10000) where make_hist(p,n,num_points)	Pablo Rodriguez Bertorello	Jun 29, 2017
PDF	Page P. 10 6th paragraph	It seems that "65%" is not correct in the following sentence and must be changed to "79166.67/(61500+48000)=72.3%"! Data scientists with more than five years experience earn 65% more than data scientists with little or no experience!	A. R. Nematollahi	Sep 29, 2023
Printed	Page 4 Second code block populating users list with friendship data	The current text is: for i, j in friendships: # this works because users[i] is the user whose id is i users[i]["friends"].append(users[j]) #add i as a friend of j users[j]["friends"].append(users[i]) #add j as a friend of i There are two issues: (1) the comments are reversed between the two code lines (already reported and listed as confirmed error). (2) the correct code for the two statements inside the for loop should actually be: users[i]["friends"].append(j) #add j as a friend of i users[j]["friends"].append(i) #add i as a friend of j	Anonymous	Sep 25, 2017
Printed, ePub	Page 5 3rd set of code text	The text in the ebook and printed book says: sorted(num_friend_by_id, key=lambda... but lambda is deprecated and does not work. The specific error message is: "tuple unpacking is not supported in Python 3"	Russ Conte	Jul 12, 2017
Printed	Page 5 Code block above figure	sorted(num_friends_by_id, key=lambda (user_id, num_friends): num_friends, reverse=True) # code provided in book (above) does not work in Python3 due to invalid syntax # this works sorted(num_friends_by_id, key=lambda num_friends: num_friends[1], reverse=True)	Anastasia Gkelameri	Jan 13, 2019
PDF	Page 5 12th total line of the page. Inside sorted(), 2nd line.	sorted(num_friends_by_id, key=lambda (user_id, num_friends): num_friends, reverse=True) The same call in my spyder (python3.7) returns that lambda is missing 1 required positional argument. I had to sort the list using other key. Just want to know if it is duo to python's version (the book says it is built on python 2.7) or anything else. Note: no value named "num_friends" was not previosly assigned in any other examples. may be useful.	Raul Dias Barboza	Jul 21, 2019
Printed, ePub	Page 6 last line of code	The line of code in the printed and pdf version reads: print friends_of_friend_ids(users[3]) it should have an extra left parenthesis, as follows: print (friends_of_friend_ids(users[3]) Note this is correct on the github page: https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/introduction.py	Russ Conte	Jul 11, 2017
Printed, ePub	Page 6 Middle of the page	Both the printed and ePub version have three lines in the middle of the page that start: print [friend["id"] for friend in users[0]["friends"]] The other two print lines are analogous. The print command is missing parentheses and does not run on my system (Python 3.6, up to date). Adding parentheses allows the lines to run correctly: print([friend["id"] for friend in users[0]["friends"])	Russ Conte	Jul 12, 2017
Printed, ePub	Page 10 Just below the middle of the page	The line in question is: for tenure_bucket, salaries in salary_by_tenure_bucket.iteritems() That generates an error message (Python 3.6.0, PyCharm 2016.3.3): AttributeError: 'collections.defaultdict' object has no attribute 'iteritems' A line that runs is: for tenure_bucket, salaries in salary_by_tenure_bucket.items()	Russ Conte	Jul 12, 2017
Printed, ePub	Page 17 Middle of the page	for the line import re as regex Python reports that regex is an alternative regular expression module, to replace re. In other words, re is now out of date. See here for more details: https://bitbucket.org/mrabarnett/mrab-regex	Russ Conte	Jul 13, 2017
Printed	Page 26 first code snippet	the text claims that this bit of code: s = some_function_that_returns_a_string() if s: first_char = s[0] else: first_char = "" is equivalent, due to truthiness, to this: first_char = s and s[0] This is not accurate. As an example, assume either s = None The result of the above if statement will be first_char equal "" The result of first_char = s and s[0] will be first_char equal to None	Adrian	Nov 17, 2017
Printed	Page 34 last paragraph	the link ipython.org/videos.html is no longer valid. perhaps ipython.org/presentation.html can be used as an alternative.	Adrian	Nov 22, 2017
PDF	Page 39 Second last line of code on the page.	The code says: "# label x-axis with movie names at bar centers plt.xticks( [ i + 0.5 for i, _ in enumerate(movies) ], movies) plt.show()" The 0.5 in plt.xticks should be replaced with the value 0.1 so that the movie names are at the bar centers. Thus the code should be: plt.xticks( [ i + 0.1 for i, _ in enumerate(movies) ], movies)	Gavan Corke	Feb 23, 2017
Printed	Page 51 6th block of example code	def vector_mean(vectors): """compute the vector whose ith element is the mean of the ith elements of the input vectors""" n = len(vectors) return scalar_multiply(1/n, vector_sum(vectors)) When you run the vector_mean function the result is always a vector full of zeros unless the list of vectors passed into the function only contains one vector. The scalar_multiply function has 1/n passed into it, but this is rounded to 0 when dividing 1 by any integer greater than 1. This is corrected by changing the 4th line of code to: n = float(len(vectors))	Jeff Wallace	Nov 15, 2017
PDF	Page 84 2nd paragraph (below first code block)	Rejection range is incorrent. "... rejects H0 when X is between 526 and 531 ..." It should be "... rejects H0 when X is larger than 526 ..."	Anonymous	Mar 14, 2017
PDF	Page 99 last paragraph	There i see below sentence: And changing one of our data points by a small amount e might increase the median by e, by some number less than e, or not at all (depending on the rest of the data). I'm confused. Changing a value might change the median by e? I think the median does not change until number changing happened in a way that sorted array of number change the data before and after the median.	Sina Saeednia	Apr 22, 2021
Printed	Page 100 last code block before the "return min_theta" statement	In the line of code: # and take a gradient step for each of the data points for x_i, y_i in in_random_order(data): gradient_i = gradient_fn(x_i, y_i, theta) theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i)) I think you meant to take the gradient on only a subset of "data". Otherwise, by looping over the entire dataset you are taking a gradient step which includes all of the data.	Eder Izaguirre	Mar 03, 2017
Other Digital Version	147 2nd to last	"model on page 142" --> model is actually on pg 143	Patrick	Jul 26, 2017
PDF	Page 219 backpropagate function definition	Dear Joel, sorry to bother you but I have a question regarding the computation of the output_deltas. It appears in the code that if the output = 1, the term [ output * (1 - output) * (output - target) ] = 0 whatever is the target value. So, I do not understand this part because the output could be 1 but not the correct value, which is expected to be equal to target value. Is it something wrong in my brain or in the code ? :) Thanks Best regards Jerome	Jerome_Massot	Jul 05, 2017
PDF	Page 219 'backpropagate' function	Two things. First the expression for output_deltas is wrong. There is actually no factor of output*(1 - output), this only comes in when considering hidden layers due to the chain rule. Although we do differentiate the sigmoid once, the result simplifies from the two terms in the definition of the logistic cost function. Secondly, it is wrong to update the weights going into the final layer before the errors for preceding layers have been calculated. As the calculation for the errors depends on the weight, we end up with wrong values for the hidden errors, and hence do not update the weights going into the hidden layer correctly. Correction of both of these yields significant improvement in performance.	Sam Vs	Feb 10, 2018
Printed	Page 284 4th paragraph	The SQL query has two errors: 1. user.id in SELECT should be users.user_id. 2. The following GROUP BY statement should be added at the end of the query: GROUP BY users.user_id The complete query should be as follows: SELECT users.user_id, COUNT(user_interests.interest) AS num_interests FROM users LEFT JOIN user_interests ON users.user_id = user_interests.user_id GROUP BY users.user_id	Sergiy Kolesnikov	Jan 10, 2018