Errata

Errata for Web Scraping with Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date submitted
	Chapter 2 Table 2-1, the meaning section for $	Simply missing the letter 'h' in the word 'thought'. Text: "This can be thougt of as analogous to the ^ symbol."	Devin	Mar 12, 2017
Printed, PDF, Mobi	Page xi About This Book, the last sentence in the 2nd paragraph.	> If you are a more advanced reader, feel free to skim these parts! skim -> skip?	niki	Feb 22, 2021
PDF	Page Chapter 1 Connecting pg 5 (PDF)	Will continue reading and gaining experience. Thank you, Rey # from Web scraping w/python 2ndED O'Reilly # book code does not work... # had to spend time researching urlopen error + certificate has expired (_ssl.c:1124) which was not the issue. # tried pip certifi and finally found a Stack Overflow post detailing difference # between requests.get and urrlib.request.urlopen # that corrected error and provided other clues # from chp 1 # from urllib.request import urlopen # html = urlopen('') # Line above results in urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] # certificate expired # getting to this line using requests.get results in # Response obj has no attrib 'read' # print(html.read()) # following works... import requests url = '' html = requests.get(url) # works # works also but lengthy... # html = requests.get('*') # print(html) # rtns Response[200] but no content print('Status code: ', html.status_code) # rtns 200 print('Content:\n ', html.text) # provides content	Anonymous	Mar 19, 2023
PDF	Page Chapter 1 Connecting pg 5 (PDF)	Submitted unconfirmed errata on 03/19/2023...however, forgot to mention python version (3.8.6), on Windows 10 and using both Idle and VS Code (vers 1.75). Sorry. Getting on in age (69) - my bad 8-) Thanks, Rey	Rey Collazo	Mar 20, 2023
PDF	Page Chapter 1 Your first web scraper, pg 8 BeautifulSoup	OK, problem continues w/expired SSL certificate when attempting to use urllib.request import urlopen. Running python 3.8 in Windows 10 (64 bit) and using Idle in a virtual environment created using python -m venv blahblah. Steps through no problem until attempting to open the page1.html which then displays: URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1124) Change to using requests library and NO issues. Issue has to be w/lack of valid SSL certificate for target web page. So will now change to using requests library w/code from this 2nd edition book. Will see if code changed from 1st edition which I doubt but you never know. Will print out the listed errata so I will have a "heads up." Thank you, Rey	Rey Collazo	Mar 20, 2023
ePub	Page 4 9	bsObj = BeautifulSoup( html) Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 423-424). O'Reilly Media. Kindle Edition. The call to BeautifulSoup on Windows Python3 from Anadonca3 can produce an error if the webpage has a non-ASCII character. It can be fixed deep down in the Python code, but better to warn the user that at least in this setting, you can get a character error.	Clifford Ireland	Jul 02, 2019
Printed	Page 9 Last code example	Need to import HTTPError like this: from urllib.error import HTTPError in order to use HTTPError handler. This is only shown in the code sample 2 pages later, and never explicitly mentioned.	STAVROS MACRAKIS	May 26, 2020
PDF	Page 10 2nd paragraph below the codes	The book says "If the server is not found at all (if, say, http://www.pythonscraping.com was down, or the URL was mistyped), urlopen returns a None object.", but actually urlopen never returns a None object. In "The Python Standard Library" documentation, the introduction of urlopen says "Note that None may be returned if no handler handles the request (though the default installed global OpenerDirector uses UnknownHandler to ensure this never happens)." (https://docs.python.org/3/library/urllib.request.html#module-urllib.request). So, instead of returning a None object, it will raise a URLError when the server is not found, as the documentation says "Raises URLError on protocol errors.". In my test, if the URL is mistyped intentionally, it returns "URLError: <urlopen error [Errno 11001] getaddrinfo failed>". The contents in this chapter introduce only the HTTPError (a subclass of URLError), but omitting the URLError. It seems incomplete.	Anonymous	Jul 17, 2016
PDF	Page 28 under the section Lambda Expressions	quote: "BeautifulSoup allows us to pass certain types of functions as parameters into the findAll function. The only restriction is that these functions must take a tag object as an argument and return a boolean. Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to “true” are returned while the rest are discarded. soup.findAll(lambda tag: len(tag.attrs) == 2)" --- it says the types of functions that can be passed into findAll() must return boolean. but the example len() you used doesn't return boolean but INT. i think you meant for the condition to evaluate to true instead of the inner function to return boolean true but the phrasing can be clearer.	Anonymous	Jul 13, 2017
Printed	Page 30 The section with code - the regex	I think that for a regex newbie like me, it would be nice if the regex was consistent. If the forward slashes don't need to be escaped then why are they. I am just really confused about the forward slashes. Thanks for your help. What I posted in a regex course: I was reading a book about python web scraping and then referenced the quick guide to regex and it seems to me that if I want to find the following pattern: ../img/gifts/img1.jpg, ../img/gifts/img2.jpg etc. The expression should really be: '\.\.\/img\/gifts\/img.\.jpg' right? Wondering if I am missing something. The response I got: sure Ray. Looks good. You can further tighten the constraints (instead of wildcard ) by explicitly looking for digits. You don't need to escape forward slashes. \.\./img/gifts/img\d{1,}\.jpg	Anonymous	May 09, 2018
PDF	Page 32 Code example on the page	When the book instructs you to build a simple scraper to find Kevin Bacon's film history, the book does not take into account that Wikipedia has blocked this type of crawler. This example on page 32 is impossible to follow along with because wikipedia now requests SSL access from the crawler.	Dockmann	Jun 01, 2017
PDF	Page 33 2nd bullet point	quote: " The URLs do not contain semicolons" should be: The URLs do not contain colons	Anonymous	Jul 20, 2017
Printed, PDF	Page 33 2nd paragraph	Second paragraph states: "The URLs do not contain semicolons" Line six of following example code: for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)$")): should be: for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!;).)$")):	JR	Oct 04, 2017
PDF	Page 34 2nd paragraph	When I ran the code in page 34 2nd Paragraph. I got the error message as follows: AttributeError: 'NoneType' object has no attribute 'find_all'	Barrick Chang	Aug 31, 2020
Printed	Page 35 First Paragraph through code midway down	The issue about the colon vs. semicolon in other errata is moot: colons do appear in valid links in the Kevin Bacon wikipedia page. Examples include: /wiki/Tremors_5:_Bloodline /wiki/X-Men:_First_Class Therefore the regular expression seeking to exclude all non-content links and include only content links excludes at least two content links.	Anonymous	May 04, 2018
PDF	Page 35 2nd paragraph	When I ran the code block in 2nd Paragraph, I got the following error message: AttributeError: 'NoneType' object has no attribute 'find_all'	Anonymous	Aug 31, 2020
PDF	Page 64 6th paragraph	If the pages are all similar (they all have basically the same types of content), you may want to add a pageType attribute to your existing web-page object: class Website: """Common base class for all articles/pages""" ------------------------------- Should the class not be named: class Webpage ?	Ron ter Borg	Dec 22, 2018
PDF	Page 64-65 end 64, begin 65	The classes Website and Webpage (and hence the derived subclasses) have been used unthoughtfully. I think all classes should be Webpage and the subclasses Product and Article should extend from Webpage.	Ron ter Borg	Dec 22, 2018
PDF	Page 73 last paragrafh before code	Using two separate Rule and LinkExtractor classes with a single parsing function... This is not correct. Rule is a function and not a class.	Ron ter Borg	Dec 23, 2018
Printed	Page 76 Code at top of page	Code for storing data into a csv file pages 75-76. Code with line "writer.writerow(csvRow)" is indented when it should not be. As code is printed in book, this line of code writes to the csv file every time the nested for loop is ran. Instead I believe it should be 'unindented' to only write to the csv file when "csvRow" has the entire row of information.	AV	Apr 25, 2019
PDF	Page 93 4th paragraph	timestamp column should be: created column	Ron ter Borg	Dec 27, 2018