Web Scraping with Python

Errata for Web Scraping with Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date Submitted
Safari Books Online Chapter 2
Table 2-1, the meaning section for $

Simply missing the letter 'h' in the word 'thought'. Text: "This can be thougt of as analogous to the ^ symbol."

Devin  Mar 12, 2017 
ePub Page 4

bsObj = BeautifulSoup( html) Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 423-424). O'Reilly Media. Kindle Edition. The call to BeautifulSoup on Windows Python3 from Anadonca3 can produce an error if the webpage has a non-ASCII character. It can be fixed deep down in the Python code, but better to warn the user that at least in this setting, you can get a character error.

Clifford Ireland  Jul 02, 2019 
PDF Page 10
2nd paragraph below the codes

The book says "If the server is not found at all (if, say, http://www.pythonscraping.com was down, or the URL was mistyped), urlopen returns a None object.", but actually urlopen never returns a None object. In "The Python Standard Library" documentation, the introduction of urlopen says "Note that None may be returned if no handler handles the request (though the default installed global OpenerDirector uses UnknownHandler to ensure this never happens)." (https://docs.python.org/3/library/urllib.request.html#module-urllib.request). So, instead of returning a None object, it will raise a URLError when the server is not found, as the documentation says "Raises URLError on protocol errors.". In my test, if the URL is mistyped intentionally, it returns "URLError: <urlopen error [Errno 11001] getaddrinfo failed>". The contents in this chapter introduce only the HTTPError (a subclass of URLError), but omitting the URLError. It seems incomplete.

Anonymous  Jul 17, 2016 
PDF Page 28
under the section Lambda Expressions

quote: "BeautifulSoup allows us to pass certain types of functions as parameters into the findAll function. The only restriction is that these functions must take a tag object as an argument and return a boolean. Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to “true” are returned while the rest are discarded. soup.findAll(lambda tag: len(tag.attrs) == 2)" --- it says the types of functions that can be passed into findAll() must return boolean. but the example len() you used doesn't return boolean but INT. i think you meant for the condition to evaluate to true instead of the inner function to return boolean true but the phrasing can be clearer.

Anonymous  Jul 13, 2017 
Printed Page 30
The section with code - the regex

I think that for a regex newbie like me, it would be nice if the regex was consistent. If the forward slashes don't need to be escaped then why are they. I am just really confused about the forward slashes. Thanks for your help. What I posted in a regex course: I was reading a book about python web scraping and then referenced the quick guide to regex and it seems to me that if I want to find the following pattern: ../img/gifts/img1.jpg, ../img/gifts/img2.jpg etc. The expression should really be: '\.\.\/img\/gifts\/img.*\.jpg' right? Wondering if I am missing something. The response I got: sure Ray. Looks good. You can further tighten the constraints (instead of wildcard *) by explicitly looking for digits. You don't need to escape forward slashes. \.\./img/gifts/img\d{1,}\.jpg

Anonymous  May 09, 2018 
PDF Page 32
Code example on the page

When the book instructs you to build a simple scraper to find Kevin Bacon's film history, the book does not take into account that Wikipedia has blocked this type of crawler. This example on page 32 is impossible to follow along with because wikipedia now requests SSL access from the crawler.

Dockmann  Jun 01, 2017 
PDF Page 33
2nd bullet point

quote: " The URLs do not contain semicolons" should be: The URLs do not contain colons

Anonymous  Jul 20, 2017 
Printed, PDF Page 33
2nd paragraph

Second paragraph states: "The URLs do not contain semicolons" Line six of following example code: for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")): should be: for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!;).)*$")):

JR  Oct 04, 2017 
Printed Page 35
First Paragraph through code midway down

The issue about the colon vs. semicolon in other errata is moot: colons do appear in valid links in the Kevin Bacon wikipedia page. Examples include: /wiki/Tremors_5:_Bloodline /wiki/X-Men:_First_Class Therefore the regular expression seeking to exclude all non-content links and include only content links excludes at least two content links.

Anonymous  May 04, 2018 
PDF Page 64
6th paragraph

If the pages are all similar (they all have basically the same types of content), you may want to add a pageType attribute to your existing web-page object: class Website: """Common base class for all articles/pages""" ------------------------------- Should the class not be named: class Webpage ?

Ron ter Borg  Dec 22, 2018 
PDF Page 64-65
end 64, begin 65

The classes Website and Webpage (and hence the derived subclasses) have been used unthoughtfully. I think all classes should be Webpage and the subclasses Product and Article should extend from Webpage.

Ron ter Borg  Dec 22, 2018 
PDF Page 73
last paragrafh before code

Using two separate Rule and LinkExtractor classes with a single parsing function... This is not correct. Rule is a function and not a class.

Ron ter Borg  Dec 23, 2018 
Printed Page 76
Code at top of page

Code for storing data into a csv file pages 75-76. Code with line "writer.writerow(csvRow)" is indented when it should not be. As code is printed in book, this line of code writes to the csv file every time the nested for loop is ran. Instead I believe it should be 'unindented' to only write to the csv file when "csvRow" has the entire row of information.

AV  Apr 25, 2019 
PDF Page 93
4th paragraph

timestamp column should be: created column

Ron ter Borg  Dec 27, 2018