Errata

Data Science from Scratch

Errata for Data Science from Scratch

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
Other Digital Version loc 2048
The Central Limit Theorem (chapter 6)

In the Kindle book, bernoulli binomial is incorrectly defined as

def binomial(n,p):
return sum(bernoulli_trial(p) for _ in range(n))

.... while in the code repo, it is correct:

def binomial(p, n):
return sum(bernoulli_trial(p) for _ in range(n))

the erroneous transposition is confusing, as later an example is given: make_hist(0.75, 100, 10000) where make_hist(p,n,num_points)

Pablo Rodriguez Bertorello  Jun 29, 2017 
PDF Page P. 10
6th paragraph

It seems that "65%" is not correct in the following sentence and must be changed to "79166.67/(61500+48000)=72.3%"!

Data scientists with more than five years experience
earn 65% more than data scientists with little or no experience!

A. R. Nematollahi  Sep 29, 2023 
Printed Page 4
Second code block populating users list with friendship data

The current text is:

for i, j in friendships:
# this works because users[i] is the user whose id is i
users[i]["friends"].append(users[j]) #add i as a friend of j
users[j]["friends"].append(users[i]) #add j as a friend of i

There are two issues:

(1) the comments are reversed between the two code lines (already reported and listed as confirmed error).

(2) the correct code for the two statements inside the for loop should actually be:
users[i]["friends"].append(j) #add j as a friend of i
users[j]["friends"].append(i) #add i as a friend of j


Anonymous  Sep 25, 2017 
Printed, ePub Page 5
3rd set of code text

The text in the ebook and printed book says:

sorted(num_friend_by_id, key=lambda...

but lambda is deprecated and does not work. The specific error message is:

"tuple unpacking is not supported in Python 3"

Russ Conte  Jul 12, 2017 
Printed Page 5
Code block above figure


sorted(num_friends_by_id,
key=lambda (user_id, num_friends): num_friends,
reverse=True)
# code provided in book (above) does not work in Python3 due to invalid syntax

# this works
sorted(num_friends_by_id,
key=lambda num_friends: num_friends[1],
reverse=True)

Anastasia Gkelameri  Jan 13, 2019 
PDF Page 5
12th total line of the page. Inside sorted(), 2nd line.

sorted(num_friends_by_id, key=lambda (user_id, num_friends): num_friends, reverse=True)

The same call in my spyder (python3.7) returns that lambda is missing 1 required positional argument.

I had to sort the list using other key. Just want to know if it is duo to python's version (the book says it is built on python 2.7) or anything else.

Note: no value named "num_friends" was not previosly assigned in any other examples. may be useful.

Raul Dias Barboza  Jul 21, 2019 
Printed, ePub Page 6
last line of code

The line of code in the printed and pdf version reads:

print friends_of_friend_ids(users[3])

it should have an extra left parenthesis, as follows:

print (friends_of_friend_ids(users[3])

Note this is correct on the github page:

https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/introduction.py

Russ Conte  Jul 11, 2017 
Printed, ePub Page 6
Middle of the page

Both the printed and ePub version have three lines in the middle of the page that start:

print [friend["id"] for friend in users[0]["friends"]]

The other two print lines are analogous.

The print command is missing parentheses and does not run on my system (Python 3.6, up to date). Adding parentheses allows the lines to run correctly:

print([friend["id"] for friend in users[0]["friends"])

Russ Conte  Jul 12, 2017 
Printed, ePub Page 10
Just below the middle of the page

The line in question is:

for tenure_bucket, salaries in salary_by_tenure_bucket.iteritems()

That generates an error message (Python 3.6.0, PyCharm 2016.3.3):

AttributeError: 'collections.defaultdict' object has no attribute 'iteritems'

A line that runs is:

for tenure_bucket, salaries in salary_by_tenure_bucket.items()

Russ Conte  Jul 12, 2017 
Printed, ePub Page 17
Middle of the page

for the line

import re as regex

Python reports that regex is an alternative regular expression module, to replace re. In other words, re is now out of date.

See here for more details:

https://bitbucket.org/mrabarnett/mrab-regex

Russ Conte  Jul 13, 2017 
Printed Page 26
first code snippet

the text claims that this bit of code:
s = some_function_that_returns_a_string()
if s:
first_char = s[0]
else:
first_char = ""

is equivalent, due to truthiness, to this:
first_char = s and s[0]

This is not accurate.

As an example, assume either s = None
The result of the above if statement will be first_char equal ""
The result of first_char = s and s[0] will be first_char equal to None

Adrian  Nov 17, 2017 
Printed Page 34
last paragraph

the link ipython.org/videos.html is no longer valid.

perhaps ipython.org/presentation.html can be used as an alternative.

Adrian  Nov 22, 2017 
PDF Page 39
Second last line of code on the page.

The code says:

"# label x-axis with movie names at bar centers

plt.xticks( [ i + 0.5 for i, _ in enumerate(movies) ], movies)

plt.show()"

The 0.5 in plt.xticks should be replaced with the value 0.1 so that the movie names are at the bar centers. Thus the code should be:

plt.xticks( [ i + 0.1 for i, _ in enumerate(movies) ], movies)

Gavan Corke  Feb 23, 2017 
Printed Page 51
6th block of example code

def vector_mean(vectors):
"""compute the vector whose ith element is the mean of the
ith elements of the input vectors"""
n = len(vectors)
return scalar_multiply(1/n, vector_sum(vectors))


When you run the vector_mean function the result is always a vector full of zeros unless the list of vectors passed into the function only contains one vector. The scalar_multiply function has 1/n passed into it, but this is rounded to 0 when dividing 1 by any integer greater than 1.

This is corrected by changing the 4th line of code to:

n = float(len(vectors))

Jeff Wallace  Nov 15, 2017 
PDF Page 84
2nd paragraph (below first code block)

Rejection range is incorrent.

"... rejects H0 when X is between 526 and 531 ..."

It should be

"... rejects H0 when X is larger than 526 ..."

Anonymous  Mar 14, 2017 
PDF Page 99
last paragraph

There i see below sentence:

And changing one of our data points by a small amount e might increase the median by e, by some number less than e, or not at all (depending on the rest of the data).

I'm confused. Changing a value might change the median by e?
I think the median does not change until number changing happened in a way that sorted array of number change the data before and after the median.

Sina Saeednia  Apr 22, 2021 
Printed Page 100
last code block before the "return min_theta" statement

In the line of code:

# and take a gradient step for each of the data points
for x_i, y_i in in_random_order(data):
gradient_i = gradient_fn(x_i, y_i, theta)
theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i))

I think you meant to take the gradient on only a subset of "data". Otherwise, by looping over the entire dataset you are taking a gradient step which includes all of the data.

Eder Izaguirre  Mar 03, 2017 
Other Digital Version 147
2nd to last

"model on page 142" --> model is actually on pg 143

Patrick  Jul 26, 2017 
PDF Page 219
backpropagate function definition

Dear Joel,

sorry to bother you but I have a question regarding the computation of the output_deltas.

It appears in the code that if the output = 1, the term
[ output * (1 - output) * (output - target) ] = 0 whatever is the target value.

So, I do not understand this part because the output could be 1 but not the correct value, which is expected to be equal to target value.

Is it something wrong in my brain or in the code ? :)

Thanks

Best regards

Jerome

Jerome_Massot  Jul 05, 2017 
PDF Page 219
'backpropagate' function

Two things. First the expression for output_deltas is wrong. There is actually no factor of output*(1 - output), this only comes in when considering hidden layers due to the chain rule. Although we do differentiate the sigmoid once, the result simplifies from the two terms in the definition of the logistic cost function.

Secondly, it is wrong to update the weights going into the final layer before the errors for preceding layers have been calculated. As the calculation for the errors depends on the weight, we end up with wrong values for the hidden errors, and hence do not update the weights going into the hidden layer correctly.

Correction of both of these yields significant improvement in performance.

Sam Vs  Feb 10, 2018 
Printed Page 284
4th paragraph

The SQL query has two errors:

1. user.id in SELECT should be users.user_id.
2. The following GROUP BY statement should be added at the end of the query:
GROUP BY users.user_id

The complete query should be as follows:

SELECT users.user_id, COUNT(user_interests.interest) AS num_interests
FROM users
LEFT JOIN user_interests
ON users.user_id = user_interests.user_id
GROUP BY users.user_id

Sergiy Kolesnikov  Jan 10, 2018