Errata
The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update
Version | Location | Description | Submitted By | Date submitted | Date corrected |
---|---|---|---|---|---|
Safari Books Online | Chapter 2 Subsection :- Duck Typing |
"this means it has a __iter__ “magic method,” though an alternative" The comma after method should actually be after ". Note from the Author or Editor: |
Naman Bhalla | Nov 11, 2017 | Sep 21, 2018 |
Safari Books Online | Ch11 subsection "Converting between string and datetime" |
In the part discussing converting datetime objects from strings, you say that strptime uses the same format codes as strftime, but that's not quite right: value = '2011-01-03' stamp = datetime.strptime(value, '%Y-%m-%d') # works datetime.strptime(value, '%F') # ValueError: 'F' is a bad directive in format '%F' datetime.strftime(stamp, '%F') # works Note from the Author or Editor: |
Alex Branham | Dec 04, 2017 | Sep 21, 2018 |
Safari Books Online | ?? Integer Indexes Section Paragraph 4 |
In the Integer Indexes section of Chapter 5 the following paragraph is ambiguous: "To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):" To test this out I defined the following object: `ser3 = Series(np.arange(4.), index=['a', 'b', -1, 34])` and ran these two commands, both of which return 2.0: `ser3[-1]` `ser3[-2]` `ser3.index` gives me "Index(['a', 'b', -1, 34], dtype='object')" So, I think you could argue that the way Pandas actually works has some ambiguity to it and that the way the book describes it is the way it SHOULD work. But to describe the actual way this part of Pandas works, the following paragraph would be more accurate: "To keep things consistent, if you have an axis index containing exclusively integers (mixed indexes will match on label first, and fall back to positional indexing), data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):" Or something of that nature. Note from the Author or Editor: |
Bob McDonald | Dec 16, 2018 | |
Mobi | Current Afer this operation, the variable a is unmodified: Suggested After this operation, the variable... |
O'Reilly Media, Inc. |
Sep 17, 2019 | ||
Other Digital Version | location 1741 top |
Found this error on the kindle version, location 1741. the line: In[76]: seq[3:4] = [6,3] should be: In[76]: seq[3:4] = [6] Note from the Author or Editor: |
Ravi | Nov 18, 2019 | |
Safari Books Online | Ch5 Indexing, selection, and filtering; Table 5-6. Indexing options with DataFrame |
`df.iloc[where]` Selects single row or subset of rows from the DataFrame by label. Should probably be "...from the DataFrame by *integer position*." |
Yung-Jin (Joey) Hu | Feb 14, 2017 | Sep 25, 2017 |
Other Digital Version | Ch5 Integer Indexes; 4th paragraph |
"an axis index containing *itnegerse*, data selection" "integers" is spelled incorrectly. |
Yung-Jin (Joey) Hu | Feb 14, 2017 | Sep 25, 2017 |
Safari Books Online | Ch5 Handling Missing Data; 2nd paragraph |
"The way that missing data is represented in pandas object is somewhat imperfect, but it is functional for a lot of *usres*." *users* is spelled incorrectly. |
Yung-Jin (Joey) Hu | Feb 14, 2017 | Sep 25, 2017 |
Safari Books Online | Ch5 Sorting and ranking; within the code examples |
It looks like `.sort_values(by=...) is deprecated. In [203]: frame.sort_index(by='b') FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...) In [207]: frame.sort_index(by=['a','b']) FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...) In [205]: frame.sort_values(by='b') fixed the problem. In [211]: pd.__version__ Out[211]: '0.19.2' |
Yung-Jin (Joey) Hu | Feb 14, 2017 | Sep 25, 2017 |
Safari Books Online | Ch5 Summarizing and Computing Descriptive Statistics; code block 3 |
The input variable df is: In [187]: df Out[187]: one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 The code in the book gives this result: In [204]: df.sum(axis=1) Out[204]: a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64 but shouldn't row "c" be "NaN" since we're summing together two NaNs? Here's what I get from my interpreter: In [186]: df.sum(axis=1) Out[186]: a 1.40 b 2.60 c NaN d -0.55 dtype: float64 In [188]: pd.__version__ Out[188]: '0.19.2' |
Yung-Jin (Joey) Hu | Feb 14, 2017 | Sep 25, 2017 |
Safari Books Online | Ch6 Note within "Indentation, not braces" section |
"I strongly recommend that you use 4 spaces *to* as your default indentation..." Should probably be: "I strongly recommend that you use 4 spaces as your default indentation..." by removing the word "to" before the second half of the sentence "... as your default indentation". |
Yung-Jin (Joey) Hu | Feb 26, 2017 | Sep 25, 2017 |
Safari Books Online | Ch6 Slicing Section, 3rd paragraph |
"While element at the start index is included, the stop..." Should probably be: "While *the* element at the start index is included, the stop..." |
Yung-Jin (Joey) Hu | Feb 26, 2017 | Sep 25, 2017 |
Safari Books Online | ? Table 3-4 |
There are two rows in the table that describe the readlines function. |
Daniel Walter | Aug 02, 2017 | Sep 25, 2017 |
Safari Books Online | Ch. 4 Table 4-2. NumPy data types |
the fourth row on the Table 4-2: Signed and unsigned """32"""-bit integer types 32 is the third row. This must be changed to 64. |
Kim, Jin | Sep 10, 2017 | Sep 25, 2017 |
Safari Books Online | 3 Boolean Indexing 5th paragraph |
The line of code: data[-(names == 'Bob')] Gives the deprecation warning: DeprecationWarning: numpy boolean negative, the `-` operator, is deprecated, use the `~` operator or the logical_not function instead. using numpy version 1.12.0 Using the tilde operator, as recommended silences the warning. |
Yung-Jin (Joey) Hu | Jan 31, 2017 | Sep 25, 2017 |
Safari Books Online | 3.1 ZIP section |
Code works as it should, but the first name and last name are reversed for Curt Schilling. Need to be a Red Sox fan to pick up on this one. ('Schilling', 'Curt') should read ('Curt', 'Schilling') in the following code: In [96]: pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'), ....: ('Schilling', 'Curt')] Note from the Author or Editor: |
Anonymous | Oct 01, 2018 | |
Safari Books Online | 5 statsmodel section |
I am not sure of the page number since I am using Safari books online which doesn't do pagination. In the statsmodel section of chapter 1, line 6, the word "grown" is misspelled (gornw). |
Bala Ganeshan | Dec 10, 2016 | Sep 25, 2017 |
Page 7 2nd paragraph from the bottom |
The text says: "...the new statsmodels project in 2010 and since then have gornw the project to a critical mass..." the word GROWN is misspelled. |
Alain Ledon | Nov 15, 2016 | Sep 25, 2017 | |
Page 9 Last line of Windows discussion |
Text states: To exit the shell, press Ctrl-D or type the command exit() and press return. On Windows, Ctrl-Z should be used. |
David Welden | Sep 16, 2017 | Sep 25, 2017 | |
Printed | Page 17 2nd Paragraph |
The reference to the IPython should be "Appendix B", not "Appendix A". Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Safari Books Online | 18 Top of second page of Chapter 2 |
The example uses 1.usa.gov data. This service has been shut down. It would be a pain to craft a whole new opening example, but you might want to. Even if you don't, you might want to let people know it's no longer online so they don't look for it. https://blog.usa.gov/decommissioning-1-usa-gov https://github.com/usagov/1.USA.gov-Data Note from the Author or Editor: |
John Transue | Dec 06, 2016 | Sep 25, 2017 |
Printed | Page 19 command |
$jupyter notebook fails under Windows 10 Command Prompt: "Error executing Jupyter command 'notebook': [Errno 'jupyter-notebook' not found] 2" version 4.4.0: "Available subcommands: kernel kernelspec migrate run troubleshoot" Trying 'run', there was no response - the command simply hung Note from the Author or Editor: |
Gregory Sherman | Jan 14, 2019 | |
Printed | Page 29 top text and commands |
Magic functions can be used by default without the percent sign ... Some magic functions behave like Pyton functions and their output can be assigned to a variable: In [22]: %pwd Out [22]: '/home/west/code/pydata-book/ In [23]: foo = %pwd ---------------------------------------------------------------------------------- First, a single quote is missing from Out[22] With ipython 6.3.1, although In [22] works using pwd without the leading percent sign, In [23] fails with "NameError: name 'pwd' is not defined" Note from the Author or Editor: |
Gregory Sherman | Apr 13, 2018 | Sep 21, 2018 |
Printed | Page 29 first sentence |
I previously reported this issue, but it's a problem beyond the typo that was addressed in the reply. "Magic functions can be used by default without the percent sign..." This is not completely true. For example, this variation of In[23] will not work: foo = pwd The % in front of the magic command can be skipped (by default) if the command is the first "word" on an IPython line. I have found that leading whitespace is not a problem. Note from the Author or Editor: |
Gregory Sherman | Apr 23, 2019 | |
Printed | Page 30 Figure 2-6 |
Running the matplotlib code exactly as printed inside Figure 2-6 gives a Type error: TypeError: float() argument must be a string or a number, not 'builtin_function_or_method' Note from the Author or Editor: |
James Shenton | Apr 15, 2020 | |
Page 37-38 Table 2-3 |
Table 2-3. Binary operators Missing the modulo (%) operator. Note from the Author or Editor: |
Ali Tobah | Sep 02, 2020 | ||
Page 38 table 2-3 |
inconsistent description of a <= b, a < b (compared to the next line), should be for a < b, a <= b Note from the Author or Editor: |
B. Goas | Feb 13, 2019 | ||
Page 38 First paragraph of "Mutable and immutable objects" |
Text says "modifiedK", should be "modified:" |
David Welden | Sep 17, 2017 | Sep 25, 2017 | |
Printed | Page 39 Third paragraph under "Numeric Types" |
"Integer division not resulting in a whole number will always yield a floating-point number." Actually, this is true for whole numbers too: In [1]: 4/2 Out [1] : 2.0 Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 44 Last paragraph of "None" section |
"but also a unique instance of NoneType" should be "but also the unique instance of NoneType". Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 46 Code blocks in 2nd and 3rd paragraphs |
Two illegal print statements: print('It's negative') Since the strings contain single quote characters, they should be delimited by double quotes. Note from the Author or Editor: |
Michael Clark | Nov 05, 2017 | Sep 21, 2018 |
Printed | Page 46 Table 2-5 |
2012-4-18 should be 2012-04-18 Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 47 2 |
2nd edition: under the "for loops" section, 1st line, "iterater" --> "iterator" (otherwise not consistent with pg 50 where 'iterator' is also mentioned). Note from the Author or Editor: |
E G | Mar 08, 2020 | |
PDF, ePub | Page 51 2nd sentence below 'pass' heading |
" ... to be taken (or as a placeholder for code not yet implemetned); ..." "implemented" is spelled incorrectly~ |
Greg Graham | Jun 06, 2017 | Sep 25, 2017 |
Printed | Page 56 First sentence |
Sentence should read: "...which locates the first such value and removes it from the list..." Note from the Author or Editor: |
Thomas Koundakjian | Nov 09, 2017 | Sep 21, 2018 |
Page 60 4th line |
Although it's never explicitly described as such, the output of the dictionary on this page is showing the dictionary as unordered. This would be incorrect, as the book is utilizing Python 3.6, the version in which dictionaries changed to insertion-ordered. Note from the Author or Editor: |
David Bankson | Jun 04, 2020 | ||
Printed | Page 61 Top |
In the example demonstrating zip, you use the names of three pitchers: Nolan Ryan, Roger Clemens, and Curt Schilling. In the example, you use zip to show first names and last names; the first_names has ('Nolan', 'Roger', 'Schilling') and last_names has ('Ryan', 'Clemens', 'Curt') Curt is his first name and Schilling is his last name, so Curt should be in first_names and Schilling in last_names. Note from the Author or Editor: |
Jon Ernster | Nov 28, 2017 | Sep 21, 2018 |
Page 65 2nd |
Hello my friend. This is related to usage and definition of a set with reference to the 'set' function in python. While you may or may not agree whether this is a minor technical mistake, it is a mistake in terms of accuracy/precision. While a set is an unordered collection of unique elements, set as defined in python seems to be a 'sorted unordered collection of unique elements.' Thus, depending on input for set(), the actual output would vary under those definitions -- subtle as they might be. Note from the Author or Editor: |
E G | Mar 10, 2020 | ||
Printed | Page 66 Table 3-1. Python set operations |
The alternative syntax for a.issubset (b) and a.issuperset (b) shoule be "<=" and "=>" respectively (not N/A). Note from the Author or Editor: |
Daniel Andersson | Feb 03, 2018 | Sep 21, 2018 |
Printed | Page 66 Last paragraph. |
"Like dicts, set elements generally must be immutable." should be "Like dict keys, set elements generally must be immutable." Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 70 bottom of page |
Suppose instead we had declared a as follows: a = [] def func(): for i in range(5): a.append(i) ======================================== The sentence implies that an explanation of what happens will follow, but there is none. Note from the Author or Editor: |
Gregory Sherman | Apr 14, 2018 | Sep 21, 2018 |
Printed | Page 80 Top line |
"As you will see later in the chapter, you can step into the stack (using the %debug or %pdb magics)..." should start with "As you will see in Appendix B..." Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 82 bottom of page; first entry of Table 3-4 |
Method Description read([size]) Return data from a string, with optional size argument indicating the number of bytes to read ============================================================= The "number of bytes" assertion is contradicted on the next page - "Python reads enough bytes ... to decode that many characters" and in the read() docstring - "Read at most n characters from stream." Note from the Author or Editor: |
Gregory Sherman | Apr 15, 2018 | Sep 21, 2018 |
Printed | Page 100 Warning box mid page |
The warning claims boolean selection will not fail if the boolean array is not the correct length. I think this was changed in Numpy 1.13, but is definitely not true in Numpy 1.14.2 For example: x = np.random.randn(5,5) y = np.array(['a','b','c', 'a', 'b', 'c', 'd', 'd', 'd']) x[y == 'a'] IndexError: boolean index did not match indexed array along dimension 0; dimension is 5 but corresponding boolean dimension is 9 Note from the Author or Editor: |
Mladen Kolovic | Apr 08, 2018 | Sep 21, 2018 |
Printed | Page 103 3rd paragraph |
1st release print copy says, “...the result of fancy indexing is always one-dimensional.” However, there are example outputs in this section with more than one dimension. Is that because some of the examples in the section are not fancy indexing? If that’s the case, it’s unclear where the section is building up to a fancy indexing example as opposed to every example being fancy indexing. The number of dimensions in the output seems to be the number of array dimensions plus one minus the number of dimensions indexed. Note from the Author or Editor: |
Stephen Frost | Feb 19, 2018 | Sep 21, 2018 |
Page 108 Table 4-3. Unary ufuncs |
Missing term: "Natural logarithm (base e), log base 10, log base 2, and , respectively". |
A. Jesse Jiryu Davis | May 23, 2017 | Sep 25, 2017 | |
Printed | Page 112,121 first sentence of 112. "Simulating ..." on 121 |
pg 112 first sentence: Here, arr.mean(1) means "compute mean across the columns" where arr.sum(0) means "compute sum down the rows" conflicts with pg. 121 "Simulating Many Random Walks at Once": "we can compute the cumulative sum across the rows" . . . In [262]: walks = steps.cumsum(1) Note from the Author or Editor: |
Gregory Sherman | Jan 04, 2019 | |
Printed | Page 114 1 |
May want to specify arr.mean(1) is the same as arr.mean(axis=1). Less assumptions the readers has to make, the better? Note from the Author or Editor: |
Shivan Sivakumaran | Oct 03, 2020 | |
Page 123 1st Paragraph |
"operations" is misspelled at the location, "Using NumPy functions or NumPy-like oeprations..." |
Ryan Shuhart | Jan 05, 2017 | Sep 25, 2017 | |
Printed | Page 126 first paragraph |
The book states when you are only passing a dict, the index in the resulting Series will have the dict's key in sorted order. However, this is not always the case. Running the code on my system I have the output pasted below. Looking at the output we see that returned series is not in sorted order. sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000} obj3=pd.Series(sdata) obj3 Out[49]: Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 Note from the Author or Editor: |
Howard Smith | Aug 30, 2018 | |
Printed | Page 126 2nd paragraph |
"You can override this by passing the dict keys in the order you want them to appear in the resulting Series" However, given the [29]- [31] commands, the actual result is Out[31]: Oregon 16000.0 California NaN Texas 71000.0 Ohio 35000.0 Note from the Author or Editor: |
Gregory Sherman | Jan 04, 2019 | |
Printed | Page 128 Ch5.1: Introduction to pandas daa Structures: Series - 8th Para |
The text says "When you are only passing a dict, the resulting Series will have the dict's keys in sorted order". This doesn't appear to be true, either with the example given in the book, or with a repro (which proves the example is not the error). These keys seem _un_sorted to me when only passing in a dict. >>> import pandas as pd >>> sdata = { 'Ohio':35000, 'Texas':71000, 'Oregon':16000,'Utah': 5000 } >>> obj3 = pd.Series(sdata) >>> obj3 Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 >>> obj3.index Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object') Note from the Author or Editor: |
Gavin Draper | Mar 16, 2021 | |
Printed | Page 138 In[104] |
As presented, this line leads to "FutureWarning: Passing list-likes to .loc with missing label will raise KeyError in future." Revise to avoid warning. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 141 Sentence beginning with word Setting in italics |
Word "section" is misspelled as "sectino" |
Anonymous | Sep 20, 2017 | Sep 25, 2017 | |
Page 145 last paragraph |
"To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented." I think this should be integer-oriented. Note from the Author or Editor: |
Yang Yang | Oct 18, 2017 | Sep 21, 2018 | |
Printed | Page 145, 146 final paragraph & code following |
[similar to a previously reported issue] "... if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) ..." In [147]: ser[:1] Out[147]: 0 0.0 dtype: float64 In[148]: ser.loc[:1] Out[148]: 0 0.0 1 1.0 dtype: float64 In[149]: ser.iloc[:1] Out[149]: 0 0.0 dtype: float64 The series "ser" is indexed by integers, so - according to the text - data selection should be label-oriented (in the absence of loc or iloc). However, Out[147] is identical to Out[149], which results from using iloc, so the "ser[:1]" data selection appears to be integer-oriented. Note from the Author or Editor: |
Gregory Sherman | Apr 30, 2019 | |
Printed | Page 158 In [233] |
Row 'c' is populated with 0.0, not NaN. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 160 Table 5-8 |
The text says: "argmin, argmax - Compute index locations (integers) at which minimum or maximum value obtained, respectively" Should be: "argmin, argmax - Compute index labels for Series at which minimum or maximum value obtained, respectively" ----------------------------------------------------------------------- Example from this chaptert - returns label, not integer In [115]: df.loc['d'].argmin() Out[115]: 'two' Note from the Author or Editor: |
Andrey Dubinchak | Dec 14, 2017 | Sep 21, 2018 | |
Page 164 table 5-9 |
Looks like instead of method "match" there should be "get_indexer" Note from the Author or Editor: |
Aivar Annamaa | Nov 18, 2017 | Sep 21, 2018 | |
Printed | Page 172 Table 6-2 |
For the argument "names", combining with "header=None" is not needed. Using the parameter "names" implies this. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 173 in Table 6-2 |
The 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012. Note from the Author or Editor: |
Noritada Kobayashi | Nov 05, 2017 | Sep 21, 2018 | |
Page 174 after Out[38] |
As Out[38] shows, the 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012. Note from the Author or Editor: |
Noritada Kobayashi | Nov 05, 2017 | Sep 21, 2018 | |
Page 175 top |
As Out[38] show, the 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012. Note from the Author or Editor: |
Noritada Kobayashi | Nov 05, 2017 | Sep 21, 2018 | |
Printed | Page 176 Bottom of page |
"tuples of values" should be "lists of values". Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 179 Out[64]: result |
In [64]: result Out[64]: {'name': 'Wes', 'pet': None, 'places_lived': ['United States', 'Spain', 'Germany'], 'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']}, {'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]} 'pet': None, Should print after the line: 'places_lived': ['United States', 'Spain', 'Germany'], Note from the Author or Editor: |
Shaahin Riazi | Apr 18, 2020 | ||
Printed | Page 180 1st paragraph |
Refers to the USDA Food Database example in Chapter 7; in second edition, this example is in Chapter 14.4 (page 436-442) Note from the Author or Editor: |
Laura Hughes | Jan 31, 2018 | Sep 21, 2018 |
Page 182 first block of code for getroot |
Code says the example file is in path: path='examples/mta_perf/Performance_MNR.xml' Actual path from git repository is: path='datasets/mta_perf/Performance_MNR.xml' Note from the Author or Editor: |
David Welden | Sep 25, 2017 | Oct 20, 2017 | |
PDF, ePub | Page 184 Link to Apache Arrow in 'Feather' Section |
URL for 'Apache Arrow' points to 'apache.arrow.org' instead of 'arrow.apache.org' |
Joel A | Oct 08, 2017 | Oct 20, 2017 |
Printed | Page 184 last paragraph & [92] |
In[92]: frame = pd.DataFrame({'a': np.random.randn(100)}) fails: ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing ------------------------- Although PyTables is mentioned in the previous text, there is no indication that this library needs to be installed. I tried "pip install PyTables", but it failed with: Collecting PyTables Could not find a version that satisfies the requirement PyTables (from versions: ) No matching distribution found for PyTables So, I'm still without PyTables and don't know how to get it. Note from the Author or Editor: |
Gregory Sherman | Jan 09, 2019 | |
Printed | Page 185 In [96] |
The command "store" at this point produces only the first two lines of the indicated output - the rest is not produced. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 185 First sentence of final paragraph |
Text reads "...how they can sunit your needs" Should be "...how they can suit your needs" Note from the Author or Editor: |
David Welden | Sep 25, 2017 | Sep 25, 2017 | |
Printed | Page 186-187 Under heading on 186, second code block on 187 |
This is not so much an error, per se, but a comment on a "may" clause in the book. I'm writing this incase you like to track these sorts of issues. On page 186, the text says "Internally these tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. You may need to install these manually with pip or conda." This is very true as the example line on the next page (187) "writer = pd.ExcelWriter('examples/ex2.xlsx')" threw an error on my system. I'm using pandas 0.21.0 within a python 3.6.2 virtual environment. Manually installing the packages in question via pip solved my problems. Thanks! Note from the Author or Editor: |
Jim Sam | Dec 08, 2017 | Sep 21, 2018 |
Printed | Page 186 [105] |
The text and command conflict: "Data stored in a sheet can then be read into DataFrame with parse: In [105]: pd.read_excel(xlsx, 'Sheet1')" Note from the Author or Editor: |
Gregory Sherman | Jan 09, 2019 | |
Page 192 Table 7-1 |
The left column is named as "Argument", which should be "Method". Note from the Author or Editor: |
Noritada Kobayashi | Nov 25, 2017 | Sep 21, 2018 | |
Printed | Page 195 In [35] |
In [35]: _ = df.fillna(0, inplace=True) The assignment "_ =" is unnecessary. Note from the Author or Editor: |
Gregory Sherman | May 11, 2019 | |
Printed | Page 204 middle |
In [85]: data = np.random.randn(20) In [86]: pd.cut(data, 4, precision = 2) . . . The precision = 2 option limits the decimal precision to two digits. ----------- However, one of the bins I get is (0.031, 0.27] Note from the Author or Editor: |
Gregory Sherman | Jan 10, 2019 | |
Printed | Page 206 last sentence |
"Calling permutation with the length of the axis you want to permute ..." According to what I have seen (an example is below) it seems that the phrase should be "the length of axis 0" or "the number of rows". Calling permutation() with the number of columns can result in rows being dropped or an IndexError. The question arises: can permutation() or a similar function randomly order columns? In [225]: df=DataFrame(np.arange(12).reshape((4,3))) In [226]: df Out[226]: 0 1 2 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 In [227]: s=np.random.permutation(4) In [228]: df.take(s) Out[228]: 0 1 2 2 6 7 8 1 3 4 5 0 0 1 2 3 9 10 11 In [229]: s=np.random.permutation(3) In [230]: df.take(s) Out[230]: 0 1 2 1 3 4 5 0 0 1 2 2 6 7 8 . . . In [258]: df=DataFrame(np.arange(12).reshape((3,4))) In [259]: s=np.random.permutation(4) In [260]: df.take(s) . . . IndexError: indices are out-of-bounds Note from the Author or Editor: |
Gregory Sherman | Jan 10, 2019 | |
Page 208 1st paragraph of a section named "Computing Indicator/Dummy Variables" |
The paragraph says "Let’s return to an earlier example DataFrame". However, since that example is contained in section 8.2 in the 2nd edition, "earlier" is not an appropriate word. Note from the Author or Editor: |
Noritada Kobayashi | Nov 27, 2017 | Sep 21, 2018 | |
Printed | Page 209 In [115] |
The parameter "engine='python'" is needed in this command. Without this, a ParserWarning is produced due to the two character separator. |
John Boersma | Nov 11, 2018 | |
Page 213 Table 7-3 |
The left column is named as "Argument", which should be "Method". Note from the Author or Editor: |
Noritada Kobayashi | Nov 25, 2017 | Sep 21, 2018 | |
Printed | Page 213 Table 7-3 |
The method "strip" is described as "equivalent to x.strip(). Isn't it exactly the same thing, not just equivalent? |
John Boersma | Nov 11, 2018 | |
Page 217 bottom |
It's really not clear what In [176]: matches.str.get(1) is supposed to be returning here. Similarly with In [177]: matches.str[0] and matches.str[0]. I would expect to be shown a method to retrieve the regex matched groups for each email address string, but this clearly isn't what happens with this syntax. Was something else meant? Note from the Author or Editor: |
Anonymous | Mar 09, 2018 | Sep 21, 2018 | |
Page 219 Table 7- 5 |
Book say: "match - Use re.match with the passed regular expression on each element, returning matched groups as list" Should say: "... returning Series/array of boolean values" And commands on pp 217 - 218 are not correct, because they return boolean values and there is no "access elements" at all. Instead of: In [174]: matches = data.str.match(pattern, flags=re.IGNORECASE) In [175]: matches Out[175]: Dave True Rob True Steve True Wes NaN dtype: object In [176]: matches.str.get(1) Out[176]:Dave NaN Rob NaN Steve NaN Wes NaN dtype: float64 In [177]: matches.str[0] Out[177]: Dave NaN Rob NaN Steve NaN Wes NaN dtype: float64 it may be better to use: In [174]: matches = data.str.extract(pattern, flags=re.IGNORECASE) In [175]: matches Out[175]: 0 1 2 Dave dave google com Rob rob gmail com Steve steve gmail com Wes NaN NaN NaN In [176]: matches[0] Out[176]: Dave dave Rob rob Steve steve Wes NaN Name: 0, dtype: object In [177]: matches.iloc[:, 0] Out[177]: Dave dave Rob rob Steve steve Wes NaN Name: 0, dtype: object Note from the Author or Editor: |
Andrey Dubinchak | Dec 27, 2017 | Sep 21, 2018 | |
Printed | Page 219 &.4 Conclusion |
"Effective data preparation can significantly improve productive by ..." should read "Effective data preparation can significantly improve productivity by ..." Note from the Author or Editor: |
Francis Lewis | Jan 10, 2018 | Sep 21, 2018 |
Printed | Page 224 The line before the section "Reordering and Sorting Levels" |
The code MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color']) should be pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color']) Note from the Author or Editor: |
Klaus Wang | May 17, 2018 | Sep 21, 2018 |
Page 229 Table 8-1 Different join types with how argument |
Final entry in table is 'output' join. It should be 'outer' join. Note from the Author or Editor: |
David Welden | Sep 26, 2017 | Oct 20, 2017 | |
Printed | Page 237 In [86] |
The command as it stands produces a FutureWarning. Either sort=True or sort=False should be added as parameters. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 241 Final paragraph |
The example of Series method combine_first is a bit vague. Although it apparently produces the desired output, the choice of b[:-2] and a[2:] for arguments is not obvious. It appears that it was chosen in order to reorder the index as well as combining data values, but this is not explained. Note from the Author or Editor: |
David Welden | Sep 27, 2017 | Oct 20, 2017 | |
Printed | Page 242 code examples with combine_first |
The operation at the bottom of page 241: In [112]: np.where(pd.isnull(a), b, a) will take elements from a where available and from b where not available in a. The analogous operation using combine_first should then probably be: a.combine_first(b) rather than: b.combine_first(a) Note from the Author or Editor: |
Artem Glebov | Dec 26, 2018 | |
Printed | Page 242 Third example on the page "In [93]:" |
The example describes the use of optional argument "join_axes", this argument, as of 4/5/21, has depreciated and now results in a TypeError. It can be replaced with reindex function now. Note from the Author or Editor: |
Dennis L Gonzales | Apr 06, 2021 | |
Page 244 After Out[131]: |
In [132] And Out[132] are the repetitions of: In[131] And Out[131] In [132] And Out[132] should be removed! Note from the Author or Editor: |
Shaahin Riazi | Oct 08, 2020 | ||
Printed | Page 255 In [18] |
As it stands, this line produces a "MatplotlibDeprecationWarning: In future re-calling will create a new instance." Best to revise to avoid a warning. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 255 explanation of [17] |
"In IPython, an empty plot window will appear" No window appeared in 7.0.1 after running [11], "%matplotlib", [16], [17], [18] and [19] Note from the Author or Editor: |
Gregory Sherman | Jan 14, 2019 | |
Printed | Page 259 Middle |
In: subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None) None does not adjust. Use 0. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 274 after Figure 9-17 |
Although mentioned that "tipping dataset used earlier in the book", the tipping dataset does not seem to be used earlier. That dataset is used first in this section and later in Ch. 10. Note from the Author or Editor: |
Noritada Kobayashi | Nov 11, 2017 | Sep 21, 2018 | |
Page 279 After Figure 9-22. |
"distplot" method has been deprecated and removed in newer versions. Note from the Author or Editor: |
Shaahin Riazi | Oct 22, 2020 | ||
Page 283 In [108]: And In[109]: |
The `factorplot` function has been renamed to `catplot`. Note from the Author or Editor: |
Shaahin Riazi | Oct 22, 2020 | ||
Page 300 In [66]: |
result = grouped['tip_pct', 'total_bill'].agg(functions) —-> needs an extra pair of [] Correct ——> result = grouped[['tip_pct', 'total_bill']].agg(functions) Note from the Author or Editor: |
Shaahin Riazi | Oct 30, 2020 | ||
Printed | Page 301 2nd Paragraph |
"indepedently" should be "independently". Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 311 Top code block |
"for suit in ['H','S','C','D']: " should be "for suit in suits:". Otherwise, there is not point in defining "suits" earlier in the code block. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 335 ts.shift(1, freq='90T') exampe |
This method with 90T parameter should lag the data by 90 minutes at 90 min frequency. Instead, it seems to preserve the monthly frequency and only lag every timestamp by 1:30hr. Am I reading this correctly or is this by design? Clarification would be helpful. Note from the Author or Editor: |
Serge | Jan 25, 2018 | Sep 21, 2018 |
Printed | Page 339 First whole paragraph |
"EST" should be "Eastern Time". The point is that the interval straddles the standard time - daylight savings time boundary. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 340 The source codes which shows Timestamp arithmetic before DST transition |
At the source code, which shows arithmetic before DST transition, the book uses '2012-3-12 01:30', tz='US/Eastern'. But, in the 2012 US/Eastern, DST starts at 2012-3-11, so the code here shows arithmetic not over the DST, it may not make sense for readers. In the first edition of this book used '2012-03-11' not '2012-03-12', and was correct. Note from the Author or Editor: |
Masato Setoyama | Mar 02, 2018 | Sep 21, 2018 |
Printed | Page 347 [197] - [199] |
"To convert back to timestamps, use to_timestamp:" There is no apparent change to the Series 'ts' by [197] & [199] - what is being demonstrated? Note from the Author or Editor: |
Gregory Sherman | Jan 19, 2019 | |
Printed | Page 351 Table 11-5, last row |
convention defaults to 'start', not 'end'. Note from the Author or Editor: |
Hengni Cai | Mar 29, 2018 | Sep 21, 2018 |
Printed | Page 352 1st and 2nd code examples. |
The 2 code examples are the same. In[216]: ts.resample('5min', closed='right').sum() In[217]: ts.resample('5min', closed='right').sum() 216 should be WITHOUT the `closed='right'` Note from the Author or Editor: |
Charbel Sarkis | Sep 27, 2018 | |
Printed, PDF, ePub | Page 358 Figure 11-5 |
Fig 11-5 caption says: Apple 250-day daily return standard deviation. However the calc is based on price, so it's the price standard deviation, which is not really what one looks at usually. The correct call to plot the return standard deviation (add pct_change()) would be (e.g.): close_px.AAPL.pct_change().rolling(252, min_periods=np.int(252/2)).std().plot() Standard in finance is to show the annualized vol, which would be: (close_px.AAPL.pct_change().rolling(252, min_periods=np.int(252/2)).std()*np.sqrt(252)).plot() Note from the Author or Editor: |
Anonymous | May 23, 2019 | |
Printed | Page 370 In [50] |
The use of outer parentheses to facilitate line breaks, which is explained on page 381, should really be explained here at the first use. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 378 Text |
The meaning of "unwrapped" here is really unclear. Does this refer to an internal process? The example is the same as on page 376, where "unwrapped" is not mentioned. Also, is "fast past" correct? Not sure what this means. Should it be "fast pass" or "fast path"? Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 384 1st Paragraph |
The Class 'Pandas.TimeGrouper' does not exist anymore. It has been replaced by ''pandas.Grouper'. The code should be changed with the following – time_key = pd.Grouper(freq='5min') Note from the Author or Editor: |
Ben B | Sep 17, 2020 | |
Printed | Page 390 In [38] |
Need parameter "rcond=None" to suppress FutureWarning. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 437 Between 1st paragraph and 2nd paragraph |
After the last sentence "Then, these can be concatenated together with concat:", it looks some python codes would be needed to make sense. These codes are found in https://github.com/wesm/pydata-book/blob/2nd-edition/ch14.ipynb , the below: nutrients = [] for rec in db: fnuts = pd.DataFrame(rec['nutrients']) fnuts['id'] = rec['id'] nutrients.append(fnuts) nutrients = pd.concat(nutrients, ignore_index=True) Note from the Author or Editor: |
Haruyoshi TAKIGUCHI | Apr 03, 2018 | Sep 21, 2018 | |
Page 452 1st paragraph |
The paragraph states that "the result is shown in Figure A-3", but Figure A-3 is "illustration", not "result" (just a cosmetic issue). Note from the Author or Editor: |
Noritada Kobayashi | Nov 26, 2017 | Sep 21, 2018 | |
Page 467 center of the page |
The paragraph states that "the output of outer will have a dimension that is the sum of the dimensions of the inputs". Since the result of outer for (3, 4) and (5,) is (3, 4, 5), is it better to replace the word "sum" with "concatenation"? Note from the Author or Editor: |
Noritada Kobayashi | Nov 26, 2017 | Sep 21, 2018 | |
Page 473 Code example 188 |
It would be better to make a zipped result more pretty for the last code example as follows: In [188]: zip(last_name[sorter], first_name[sorter]) Out[188]: <zip at 0x7fa203eda1c8> Note from the Author or Editor: |
Noritada Kobayashi | Nov 27, 2017 | Sep 21, 2018 | |
Printed | Page 479 [214] through [215] |
In [214]: numba_mean_distance = nb.jit(mean_distance) We could also have written this as a decorator: @nb.jit def mean_distance(x, y): . . . In [215]: %timeit numba_mean_distance(x, y) To be consistent, I would make the definition begin with "def numba_mean_distance(x, y):" Note from the Author or Editor: |
Gregory Sherman | Feb 01, 2019 | |
Printed | Page 482 Top |
"mmap" is a fairly large file on disk. It would be good to add a command to delete it when done here. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 483 [230] and [231], plus preceding text |
"In this example, summing the rows of these arrays should, in theory, be faster for arr_c than arr_f ..." Runs on my Windows 10 PC consistently show the opposite, like: In [46]: %timeit arr_c.sum(1) 1.65 ms ± 9.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [47]: %timeit arr_f.sum(1) 994 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) I have carefully checked: "C_CONTIGUOUS : True" for arr_c and "F_CONTIGUOUS : True" for arr_f Any idea what's going on? Note from the Author or Editor: |
Gregory Sherman | Jan 30, 2019 | |
Printed | Page 483 preceding text and [230] and [231] |
[more on same issue] On my PC, I found that sum(0) runs faster on arr_c : In [17]: %timeit arr_c.sum(0) 953 µs ± 9.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [18]: %timeit arr_f.sum(0) 1.6 ms ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) I wonder if the output in [230] and [231] does not actually result from what was built in [225] and [226]. Note from the Author or Editor: |
Gregory Sherman | Feb 02, 2019 | |
Page 485 The first paragraph and code |
Original: Since the input variables are strings they can be executed again with the Python exec keyword: In [30]: exec(_i27) I propose the following: Since the input variables are strings they can be evaluated again with the Python eval keyword: In [30]: eval(_i27) Out[30]: 'bar' It looks "exec" does not make sense in this context because _i27 is not a statement or a code. Note from the Author or Editor: |
Haruyoshi TAKIGUCHI | Apr 28, 2018 | Sep 21, 2018 | |
Printed | Page 487 Scorpion comment |
The comment that deleting a variable does not free up memory appears to be incorrect. After using del I had a decrease in memory used on my mac as shown on activity monitor. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Printed | Page 491 Middle |
"works_fine" method should be "works_fine function". Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Page 494 The first code quote in the section "Basic Pro ling: %prun and %run -p" |
Found two syntax errors in Python3. 1) for _ in xrange(niter): needed to be replaced by like for _ in range(niter): 2) print 'Largest one we saw: %s' % np.max(some_results) needed to be replaced by like print('Largest one we saw: {0}'.format(np.max(some_results))) Note from the Author or Editor: |
Haruyoshi TAKIGUCHI | Apr 08, 2018 | Sep 21, 2018 | |
Printed | Page 495 In [561] and In [562] |
Reported Wall times are way off. More like 250ms and 100ms. Note from the Author or Editor: |
John Boersma | Nov 11, 2018 | |
Other Digital Version | 2255 Functions Are Objects (section) |
In Amazon Kindle version, Chapter 3: Section "Functions Are Object", the text explains that the code: import re def clean_strings(strings): result = [] for value in strings: value = value.strip() value = re.sub('[!#?]', '', value) value = value.title() result.append(value) return result Should clean the data FROM: states = [ ' Alabama ', 'Georgia!' , 'Georgia', 'georgia', 'Fl0rida', ... ] TO: ['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', ... ] When running the code, none of the methods change 'Fl0rida' to 'Florida' as mentioned in the text. All the other data entry is working. Note from the Author or Editor: |
Kyle Jeffreys | May 16, 2020 | |
Mobi | Page 2621 |
"For large DataFrames, the head method is useful to get see the first 5 rows:" 'get' should be removed |
Bridgeland | Mar 29, 2017 | Sep 25, 2017 |