Python for Data Analysis

Errata for Python for Data Analysis

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update



Version Location Description Submitted By Date Submitted Date Corrected
Safari Books Online
Chapter 2
Subsection :- Duck Typing

"this means it has a __iter__ “magic method,” though an alternative" The comma after method should actually be after ".

Note from the Author or Editor:
Changing this in the source material

Naman Bhalla  Nov 11, 2017  Sep 21, 2018
Safari Books Online
Ch11
subsection "Converting between string and datetime"

In the part discussing converting datetime objects from strings, you say that strptime uses the same format codes as strftime, but that's not quite right: value = '2011-01-03' stamp = datetime.strptime(value, '%Y-%m-%d') # works datetime.strptime(value, '%F') # ValueError: 'F' is a bad directive in format '%F' datetime.strftime(stamp, '%F') # works

Note from the Author or Editor:
Quite right. Fixing the language to say "many of the same"

Alex Branham  Dec 04, 2017  Sep 21, 2018
Safari Books Online
Ch6
Slicing Section, 3rd paragraph

"While element at the start index is included, the stop..." Should probably be: "While *the* element at the start index is included, the stop..."

Yung-Jin (Joey) Hu  Feb 26, 2017  Sep 25, 2017
Safari Books Online
Ch6
Note within "Indentation, not braces" section

"I strongly recommend that you use 4 spaces *to* as your default indentation..." Should probably be: "I strongly recommend that you use 4 spaces as your default indentation..." by removing the word "to" before the second half of the sentence "... as your default indentation".

Yung-Jin (Joey) Hu  Feb 26, 2017  Sep 25, 2017
Safari Books Online
Ch5
Summarizing and Computing Descriptive Statistics; code block 3

The input variable df is: In [187]: df Out[187]: one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 The code in the book gives this result: In [204]: df.sum(axis=1) Out[204]: a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64 but shouldn't row "c" be "NaN" since we're summing together two NaNs? Here's what I get from my interpreter: In [186]: df.sum(axis=1) Out[186]: a 1.40 b 2.60 c NaN d -0.55 dtype: float64 In [188]: pd.__version__ Out[188]: '0.19.2'

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Safari Books Online
Ch5
Sorting and ranking; within the code examples

It looks like `.sort_values(by=...) is deprecated. In [203]: frame.sort_index(by='b') FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...) In [207]: frame.sort_index(by=['a','b']) FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...) In [205]: frame.sort_values(by='b') fixed the problem. In [211]: pd.__version__ Out[211]: '0.19.2'

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Safari Books Online
Ch5
Handling Missing Data; 2nd paragraph

"The way that missing data is represented in pandas object is somewhat imperfect, but it is functional for a lot of *usres*." *users* is spelled incorrectly.

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Other Digital Version
Ch5
Integer Indexes; 4th paragraph

"an axis index containing *itnegerse*, data selection" "integers" is spelled incorrectly.

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Safari Books Online
Ch5
Indexing, selection, and filtering; Table 5-6. Indexing options with DataFrame

`df.iloc[where]` Selects single row or subset of rows from the DataFrame by label. Should probably be "...from the DataFrame by *integer position*."

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Safari Books Online
?
Table 3-4

There are two rows in the table that describe the readlines function.

Daniel Walter  Aug 02, 2017  Sep 25, 2017
Safari Books Online
Ch. 4
Table 4-2. NumPy data types

the fourth row on the Table 4-2: Signed and unsigned """32"""-bit integer types 32 is the third row. This must be changed to 64.

Kim, Jin  Sep 10, 2017  Sep 25, 2017
Safari Books Online
3
Boolean Indexing 5th paragraph

The line of code: data[-(names == 'Bob')] Gives the deprecation warning: DeprecationWarning: numpy boolean negative, the `-` operator, is deprecated, use the `~` operator or the logical_not function instead. using numpy version 1.12.0 Using the tilde operator, as recommended silences the warning.

Yung-Jin (Joey) Hu  Jan 31, 2017  Sep 25, 2017
Safari Books Online
5
statsmodel section

I am not sure of the page number since I am using Safari books online which doesn't do pagination. In the statsmodel section of chapter 1, line 6, the word "grown" is misspelled (gornw).

Bala Ganeshan  Dec 10, 2016  Sep 25, 2017
PDF
Page 7
2nd paragraph from the bottom

The text says: "...the new statsmodels project in 2010 and since then have gornw the project to a critical mass..." the word GROWN is misspelled.

Alain Ledon  Nov 15, 2016  Sep 25, 2017
PDF
Page 9
Last line of Windows discussion

Text states: To exit the shell, press Ctrl-D or type the command exit() and press return. On Windows, Ctrl-Z should be used.

David Welden  Sep 16, 2017  Sep 25, 2017
Safari Books Online
18
Top of second page of Chapter 2

The example uses 1.usa.gov data. This service has been shut down. It would be a pain to craft a whole new opening example, but you might want to. Even if you don't, you might want to let people know it's no longer online so they don't look for it. https://blog.usa.gov/decommissioning-1-usa-gov https://github.com/usagov/1.USA.gov-Data

Note from the Author or Editor:
You are right. I added a note that it is decommissioned.

John Transue  Dec 06, 2016  Sep 25, 2017
Printed
Page 29
top text and commands

Magic functions can be used by default without the percent sign ... Some magic functions behave like Pyton functions and their output can be assigned to a variable: In [22]: %pwd Out [22]: '/home/west/code/pydata-book/ In [23]: foo = %pwd ---------------------------------------------------------------------------------- First, a single quote is missing from Out[22] With ipython 6.3.1, although In [22] works using pwd without the leading percent sign, In [23] fails with "NameError: name 'pwd' is not defined"

Note from the Author or Editor:
Fixing this typo

Gregory Sherman  Apr 13, 2018  Sep 21, 2018
PDF
Page 38
First paragraph of "Mutable and immutable objects"

Text says "modifiedK", should be "modified:"

David Welden  Sep 17, 2017  Sep 25, 2017
Printed
Page 46
Code blocks in 2nd and 3rd paragraphs

Two illegal print statements: print('It's negative') Since the strings contain single quote characters, they should be delimited by double quotes.

Note from the Author or Editor:
Thank you, fixed

Michael Clark  Nov 05, 2017  Sep 21, 2018
PDF, ePub
Page 51
2nd sentence below 'pass' heading

" ... to be taken (or as a placeholder for code not yet implemetned); ..." "implemented" is spelled incorrectly~

Greg Graham  Jun 06, 2017  Sep 25, 2017
Printed
Page 56
First sentence

Sentence should read: "...which locates the first such value and removes it from the list..."

Note from the Author or Editor:
Thanks! Fixing the typo

Thomas Koundakjian  Nov 09, 2017  Sep 21, 2018
Printed
Page 61
Top

In the example demonstrating zip, you use the names of three pitchers: Nolan Ryan, Roger Clemens, and Curt Schilling. In the example, you use zip to show first names and last names; the first_names has ('Nolan', 'Roger', 'Schilling') and last_names has ('Ryan', 'Clemens', 'Curt') Curt is his first name and Schilling is his last name, so Curt should be in first_names and Schilling in last_names.

Note from the Author or Editor:
Fixing this mistake

Jon Ernster  Nov 28, 2017  Sep 21, 2018
Printed
Page 66
Table 3-1. Python set operations

The alternative syntax for a.issubset (b) and a.issuperset (b) shoule be "<=" and "=>" respectively (not N/A).

Note from the Author or Editor:
Fixing this.

Daniel Andersson  Feb 03, 2018  Sep 21, 2018
Printed
Page 70
bottom of page

Suppose instead we had declared a as follows: a = [] def func(): for i in range(5): a.append(i) ======================================== The sentence implies that an explanation of what happens will follow, but there is none.

Note from the Author or Editor:
Good catch. I'm adding a code example to show how the alternate example works

Gregory Sherman  Apr 14, 2018  Sep 21, 2018
Printed
Page 82
bottom of page; first entry of Table 3-4

Method Description read([size]) Return data from a string, with optional size argument indicating the number of bytes to read ============================================================= The "number of bytes" assertion is contradicted on the next page - "Python reads enough bytes ... to decode that many characters" and in the read() docstring - "Read at most n characters from stream."

Note from the Author or Editor:
Adding language to indicate that whether bytes or unicode are read depends on the mode of the file

Gregory Sherman  Apr 15, 2018  Sep 21, 2018
Printed
Page 100
Warning box mid page

The warning claims boolean selection will not fail if the boolean array is not the correct length. I think this was changed in Numpy 1.13, but is definitely not true in Numpy 1.14.2 For example: x = np.random.randn(5,5) y = np.array(['a','b','c', 'a', 'b', 'c', 'd', 'd', 'd']) x[y == 'a'] IndexError: boolean index did not match indexed array along dimension 0; dimension is 5 but corresponding boolean dimension is 9

Note from the Author or Editor:
Removing the caution box

Mladen Kolovic  Apr 08, 2018  Sep 21, 2018
Printed
Page 103
3rd paragraph

1st release print copy says, “...the result of fancy indexing is always one-dimensional.” However, there are example outputs in this section with more than one dimension. Is that because some of the examples in the section are not fancy indexing? If that’s the case, it’s unclear where the section is building up to a fancy indexing example as opposed to every example being fancy indexing. The number of dimensions in the output seems to be the number of array dimensions plus one minus the number of dimensions indexed.

Note from the Author or Editor:
The text is unclear, I will clarify

Stephen Frost  Feb 19, 2018  Sep 21, 2018
PDF
Page 108
Table 4-3. Unary ufuncs

Missing term: "Natural logarithm (base e), log base 10, log base 2, and , respectively".

A. Jesse Jiryu Davis  May 23, 2017  Sep 25, 2017
PDF
Page 123
1st Paragraph

"operations" is misspelled at the location, "Using NumPy functions or NumPy-like oeprations..."

Ryan Shuhart  Jan 05, 2017  Sep 25, 2017
PDF
Page 141
Sentence beginning with word Setting in italics

Word "section" is misspelled as "sectino"

Anonymous  Sep 20, 2017  Sep 25, 2017
PDF
Page 145
last paragraph

"To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented." I think this should be integer-oriented.

Note from the Author or Editor:
The code example does not illustrate the intended behavior. I am changing the example to be "ser[-1]" instead of "ser[:2]" and added a note that slicing with integers ignores the integer labels

Yang Yang  Oct 18, 2017  Sep 21, 2018
PDF
Page 160
Table 5-8

The text says: "argmin, argmax - Compute index locations (integers) at which minimum or maximum value obtained, respectively" Should be: "argmin, argmax - Compute index labels for Series at which minimum or maximum value obtained, respectively" ----------------------------------------------------------------------- Example from this chaptert - returns label, not integer In [115]: df.loc['d'].argmin() Out[115]: 'two'

Note from the Author or Editor:
Per https://github.com/pandas-dev/pandas/issues/16830 this is supposed to return the positional values but did not for a while because of some changes in pandas. In the future, it will do the right thing (what the book says now), so I'm not going to change the book

Andrey Dubinchak  Dec 14, 2017  Sep 21, 2018
PDF
Page 164
table 5-9

Looks like instead of method "match" there should be "get_indexer"

Note from the Author or Editor:
Fixing this to "get_indexer"

Aivar Annamaa  Nov 18, 2017  Sep 21, 2018
PDF
Page 173
in Table 6-2

The 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012.

Note from the Author or Editor:
I am changing the test to correspond to changes in the latest version of pandas

Noritada Kobayashi  Nov 05, 2017  Sep 21, 2018
PDF
Page 174
after Out[38]

As Out[38] shows, the 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012.

Note from the Author or Editor:
I am changing the text to correspond with changes in pandas

Noritada Kobayashi  Nov 05, 2017  Sep 21, 2018
PDF
Page 175
top

As Out[38] show, the 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012.

Note from the Author or Editor:
I am changing the text to correspond to the current version of pandas

Noritada Kobayashi  Nov 05, 2017  Sep 21, 2018
Printed
Page 180
1st paragraph

Refers to the USDA Food Database example in Chapter 7; in second edition, this example is in Chapter 14.4 (page 436-442)

Note from the Author or Editor:
Fixing this reference to point to Ch 14

Laura Hughes  Jan 31, 2018  Sep 21, 2018
PDF
Page 182
first block of code for getroot

Code says the example file is in path: path='examples/mta_perf/Performance_MNR.xml' Actual path from git repository is: path='datasets/mta_perf/Performance_MNR.xml'

Note from the Author or Editor:
Correct, thank you. This will need to be fixed in the source files

David Welden  Sep 25, 2017  Oct 20, 2017
PDF, ePub
Page 184
Link to Apache Arrow in 'Feather' Section

URL for 'Apache Arrow' points to 'apache.arrow.org' instead of 'arrow.apache.org'

Joel A  Oct 08, 2017  Oct 20, 2017
PDF
Page 185
First sentence of final paragraph

Text reads "...how they can sunit your needs" Should be "...how they can suit your needs"

Note from the Author or Editor:
This typo is fixed in the final 2nd edition

David Welden  Sep 25, 2017  Sep 25, 2017
Printed
Page 186-187
Under heading on 186, second code block on 187

This is not so much an error, per se, but a comment on a "may" clause in the book. I'm writing this incase you like to track these sorts of issues. On page 186, the text says "Internally these tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. You may need to install these manually with pip or conda." This is very true as the example line on the next page (187) "writer = pd.ExcelWriter('examples/ex2.xlsx')" threw an error on my system. I'm using pandas 0.21.0 within a python 3.6.2 virtual environment. Manually installing the packages in question via pip solved my problems. Thanks!

Note from the Author or Editor:
I'm changing the language to say "These must be installed separately"

Jim Sam  Dec 08, 2017  Sep 21, 2018
PDF
Page 192
Table 7-1

The left column is named as "Argument", which should be "Method".

Note from the Author or Editor:
Making suggested change

Noritada Kobayashi  Nov 25, 2017  Sep 21, 2018
PDF
Page 208
1st paragraph of a section named "Computing Indicator/Dummy Variables"

The paragraph says "Let’s return to an earlier example DataFrame". However, since that example is contained in section 8.2 in the 2nd edition, "earlier" is not an appropriate word.

Note from the Author or Editor:
Fixing language to "Let's consider an example DataFrame..."

Noritada Kobayashi  Nov 27, 2017  Sep 21, 2018
PDF
Page 213
Table 7-3

The left column is named as "Argument", which should be "Method".

Note from the Author or Editor:
Making suggested change

Noritada Kobayashi  Nov 25, 2017  Sep 21, 2018
PDF
Page 217
bottom

It's really not clear what In [176]: matches.str.get(1) is supposed to be returning here. Similarly with In [177]: matches.str[0] and matches.str[0]. I would expect to be shown a method to retrieve the regex matched groups for each email address string, but this clearly isn't what happens with this syntax. Was something else meant?

Note from the Author or Editor:
I am fixing this example. The erratum was reported by many others

Anonymous  Mar 09, 2018  Sep 21, 2018
Printed
Page 219
&.4 Conclusion

"Effective data preparation can significantly improve productive by ..." should read "Effective data preparation can significantly improve productivity by ..."

Note from the Author or Editor:
Fixing the typo

Francis Lewis  Jan 10, 2018  Sep 21, 2018
PDF
Page 219
Table 7- 5

Book say: "match - Use re.match with the passed regular expression on each element, returning matched groups as list" Should say: "... returning Series/array of boolean values" And commands on pp 217 - 218 are not correct, because they return boolean values and there is no "access elements" at all. Instead of: In [174]: matches = data.str.match(pattern, flags=re.IGNORECASE) In [175]: matches Out[175]: Dave True Rob True Steve True Wes NaN dtype: object In [176]: matches.str.get(1) Out[176]:Dave NaN Rob NaN Steve NaN Wes NaN dtype: float64 In [177]: matches.str[0] Out[177]: Dave NaN Rob NaN Steve NaN Wes NaN dtype: float64 it may be better to use: In [174]: matches = data.str.extract(pattern, flags=re.IGNORECASE) In [175]: matches Out[175]: 0 1 2 Dave dave google com Rob rob gmail com Steve steve gmail com Wes NaN NaN NaN In [176]: matches[0] Out[176]: Dave dave Rob rob Steve steve Wes NaN Name: 0, dtype: object In [177]: matches.iloc[:, 0] Out[177]: Dave dave Rob rob Steve steve Wes NaN Name: 0, dtype: object

Note from the Author or Editor:
This behavior changed in pandas. I'm correcting the code examples and the language in the text

Andrey Dubinchak  Dec 27, 2017  Sep 21, 2018
Printed
Page 224
The line before the section "Reordering and Sorting Levels"

The code MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color']) should be pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color'])

Note from the Author or Editor:
Adding the "pd."

Klaus Wang  May 17, 2018  Sep 21, 2018
PDF
Page 229
Table 8-1 Different join types with how argument

Final entry in table is 'output' join. It should be 'outer' join.

Note from the Author or Editor:
This needs to be changed from "output" to "outer"

David Welden  Sep 26, 2017  Oct 20, 2017
PDF
Page 241
Final paragraph

The example of Series method combine_first is a bit vague. Although it apparently produces the desired output, the choice of b[:-2] and a[2:] for arguments is not obvious. It appears that it was chosen in order to reorder the index as well as combining data values, but this is not explained.

Note from the Author or Editor:
I am changing the code example to omit the slicing, and instead make "a" and "b" have their index labels in different order. This will definitely be clearer to the reader. Thanks for pointing this out

David Welden  Sep 27, 2017  Oct 20, 2017
PDF
Page 274
after Figure 9-17

Although mentioned that "tipping dataset used earlier in the book", the tipping dataset does not seem to be used earlier. That dataset is used first in this section and later in Ch. 10.

Note from the Author or Editor:
This is also used in chapter 9, but the language there was also incorrect. I am tweaking the language in both chapters 9 and 10 to reflect that these are the first times that readers will have seen this dataset

Noritada Kobayashi  Nov 11, 2017  Sep 21, 2018
Printed
Page 335
ts.shift(1, freq='90T') exampe

This method with 90T parameter should lag the data by 90 minutes at 90 min frequency. Instead, it seems to preserve the monthly frequency and only lag every timestamp by 1:30hr. Am I reading this correctly or is this by design? Clarification would be helpful.

Note from the Author or Editor:
I will add a note to the text to clarify that the "freq" parameter does not change the frequency of the data (if any)

Serge  Jan 25, 2018  Sep 21, 2018
Printed
Page 340
The source codes which shows Timestamp arithmetic before DST transition

At the source code, which shows arithmetic before DST transition, the book uses '2012-3-12 01:30', tz='US/Eastern'. But, in the 2012 US/Eastern, DST starts at 2012-3-11, so the code here shows arithmetic not over the DST, it may not make sense for readers. In the first edition of this book used '2012-03-11' not '2012-03-12', and was correct.

Note from the Author or Editor:
Confirmed. Fixing

Masato Setoyama  Mar 02, 2018  Sep 21, 2018
Printed
Page 351
Table 11-5, last row

convention defaults to 'start', not 'end'.

Note from the Author or Editor:
Fixing.

Hengni Cai  Mar 29, 2018  Sep 21, 2018
PDF
Page 437
Between 1st paragraph and 2nd paragraph

After the last sentence "Then, these can be concatenated together with concat:", it looks some python codes would be needed to make sense. These codes are found in https://github.com/wesm/pydata-book/blob/2nd-edition/ch14.ipynb , the below: nutrients = [] for rec in db: fnuts = pd.DataFrame(rec['nutrients']) fnuts['id'] = rec['id'] nutrients.append(fnuts) nutrients = pd.concat(nutrients, ignore_index=True)

Note from the Author or Editor:
Thanks -- I am restoring the code to the text (it was being accidentally suppressed in the output)

Haruyoshi TAKIGUCHI  Apr 03, 2018  Sep 21, 2018
PDF
Page 452
1st paragraph

The paragraph states that "the result is shown in Figure A-3", but Figure A-3 is "illustration", not "result" (just a cosmetic issue).

Note from the Author or Editor:
Changing language to "this is illustrated in Figure A-3"

Noritada Kobayashi  Nov 26, 2017  Sep 21, 2018
PDF
Page 467
center of the page

The paragraph states that "the output of outer will have a dimension that is the sum of the dimensions of the inputs". Since the result of outer for (3, 4) and (5,) is (3, 4, 5), is it better to replace the word "sum" with "concatenation"?

Note from the Author or Editor:
Making suggested change

Noritada Kobayashi  Nov 26, 2017  Sep 21, 2018
PDF
Page 473
Code example 188

It would be better to make a zipped result more pretty for the last code example as follows: In [188]: zip(last_name[sorter], first_name[sorter]) Out[188]: <zip at 0x7fa203eda1c8>

Note from the Author or Editor:
Adding "list(...)" to make the example prettier

Noritada Kobayashi  Nov 27, 2017  Sep 21, 2018
PDF
Page 485
The first paragraph and code

Original: Since the input variables are strings they can be executed again with the Python exec keyword: In [30]: exec(_i27) I propose the following: Since the input variables are strings they can be evaluated again with the Python eval keyword: In [30]: eval(_i27) Out[30]: 'bar' It looks "exec" does not make sense in this context because _i27 is not a statement or a code.

Note from the Author or Editor:
It's not a mistake but "eval" makes the example more illustrative. Changing

Haruyoshi TAKIGUCHI  Apr 28, 2018  Sep 21, 2018
PDF
Page 494
The first code quote in the section "Basic Pro ling: %prun and %run -p"

Found two syntax errors in Python3. 1) for _ in xrange(niter): needed to be replaced by like for _ in range(niter): 2) print 'Largest one we saw: %s' % np.max(some_results) needed to be replaced by like print('Largest one we saw: {0}'.format(np.max(some_results)))

Note from the Author or Editor:
Fixing this

Haruyoshi TAKIGUCHI  Apr 08, 2018  Sep 21, 2018
Mobi
Page 2621

"For large DataFrames, the head method is useful to get see the first 5 rows:" 'get' should be removed

Bridgeland  Mar 29, 2017  Sep 25, 2017