Errata

Web Scraping with Python

Errata for Web Scraping with Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
Chapter 2
Table 2-1, the meaning section for $

Simply missing the letter 'h' in the word 'thought'. Text: "This can be thougt of as analogous to the ^ symbol."

Devin  Mar 12, 2017 
Printed, PDF, Mobi Page xi
About This Book, the last sentence in the 2nd paragraph.

> If you are a more advanced reader, feel free to skim these parts!

skim -> skip?

niki  Feb 22, 2021 
PDF Page Chapter 1
Connecting pg 5 (PDF)

Will continue reading and gaining experience. Thank you, Rey

# from Web scraping w/python 2ndED O'Reilly

# book code does not work...
# had to spend time researching urlopen error + certificate has expired (_ssl.c:1124) which was not the issue.
# tried pip certifi and finally found a Stack Overflow post detailing difference
# between requests.get and urrlib.request.urlopen
# that corrected error and provided other clues

# from chp 1
# from urllib.request import urlopen
# html = urlopen('*')
# Line above results in urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]
# certificate expired

# getting to this line using requests.get results in
# Response obj has no attrib 'read'
# print(html.read())


# following works...

import requests

url = '*'

html = requests.get(url) # works

# works also but lengthy...
# html = requests.get('*')

# print(html) # rtns Response[200] but no content
print('Status code: ', html.status_code) # rtns 200
print('Content:\n ', html.text) # provides content

Anonymous  Mar 19, 2023 
PDF Page Chapter 1
Connecting pg 5 (PDF)

Submitted unconfirmed errata on 03/19/2023...however, forgot to mention python version (3.8.6), on Windows 10 and using both Idle and VS Code (vers 1.75).
Sorry. Getting on in age (69) - my bad 8-)

Thanks, Rey

Rey Collazo  Mar 20, 2023 
PDF Page Chapter 1
Your first web scraper, pg 8 BeautifulSoup

OK, problem continues w/expired SSL certificate when attempting to use urllib.request import urlopen.
Running python 3.8 in Windows 10 (64 bit) and using Idle in a virtual environment created using python -m venv blahblah.

Steps through no problem until attempting to open the page1.html which then displays:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1124)

Change to using requests library and NO issues. Issue has to be w/lack of valid SSL certificate for target web page.

So will now change to using requests library w/code from this 2nd edition book.
Will see if code changed from 1st edition which I doubt but you never know.
Will print out the listed errata so I will have a "heads up."

Thank you, Rey

Rey Collazo  Mar 20, 2023 
ePub Page 4
9


bsObj = BeautifulSoup( html)

Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 423-424). O'Reilly Media. Kindle Edition.

The call to BeautifulSoup on Windows Python3 from Anadonca3 can produce an error if the webpage has a non-ASCII character. It can be fixed deep down in the Python code, but better to warn the user that at least in this setting, you can get a character error.

Clifford Ireland  Jul 02, 2019 
Printed Page 9
Last code example

Need to import HTTPError like this:

from urllib.error import HTTPError

in order to use HTTPError handler.

This is only shown in the code sample 2 pages later, and never explicitly mentioned.

STAVROS MACRAKIS  May 26, 2020 
PDF Page 10
2nd paragraph below the codes

The book says
"If the server is not found at all (if, say, http://www.pythonscraping.com was down, or the URL was mistyped), urlopen returns a None object.", but actually urlopen never returns a None object.
In "The Python Standard Library" documentation, the introduction of urlopen says "Note that None may be returned if no handler handles the request (though the default installed global OpenerDirector uses UnknownHandler to ensure this never happens)." (https://docs.python.org/3/library/urllib.request.html#module-urllib.request).
So, instead of returning a None object, it will raise a URLError when the server is not found, as the documentation says "Raises URLError on protocol errors.".
In my test, if the URL is mistyped intentionally, it returns "URLError: <urlopen error [Errno 11001] getaddrinfo failed>".
The contents in this chapter introduce only the HTTPError (a subclass of URLError), but omitting the URLError. It seems incomplete.

Anonymous  Jul 17, 2016 
PDF Page 28
under the section Lambda Expressions

quote:
"BeautifulSoup allows us to pass certain types of functions as parameters into the findAll function. The only restriction is that these functions must take a tag object as an argument and return a boolean.

Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to “true” are returned while the rest are discarded.

soup.findAll(lambda tag: len(tag.attrs) == 2)"

---
it says the types of functions that can be passed into findAll() must return boolean. but the example len() you used doesn't return boolean but INT. i think you meant for the condition to evaluate to true instead of the inner function to return boolean true but the phrasing can be clearer.

Anonymous  Jul 13, 2017 
Printed Page 30
The section with code - the regex

I think that for a regex newbie like me, it would be nice if the regex was consistent. If the forward slashes don't need to be escaped then why are they.

I am just really confused about the forward slashes.

Thanks for your help.

What I posted in a regex course:
I was reading a book about python web scraping and then referenced the quick guide to regex and it seems to me that if I want to find the following pattern:

../img/gifts/img1.jpg, ../img/gifts/img2.jpg etc.

The expression should really be:

'\.\.\/img\/gifts\/img.*\.jpg' right?

Wondering if I am missing something.

The response I got:
sure Ray. Looks good. You can further tighten the constraints (instead of wildcard *) by explicitly looking for digits. You don't need to escape forward slashes.

\.\./img/gifts/img\d{1,}\.jpg

Anonymous  May 09, 2018 
PDF Page 32
Code example on the page

When the book instructs you to build a simple scraper to find Kevin Bacon's film history, the book does not take into account that Wikipedia has blocked this type of crawler.

This example on page 32 is impossible to follow along with because wikipedia now requests SSL access from the crawler.

Dockmann  Jun 01, 2017 
PDF Page 33
2nd bullet point

quote: " The URLs do not contain semicolons"

should be: The URLs do not contain colons

Anonymous  Jul 20, 2017 
Printed, PDF Page 33
2nd paragraph

Second paragraph states:

"The URLs do not contain semicolons"

Line six of following example code:

for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",
href=re.compile("^(/wiki/)((?!:).)*$")):

should be:

for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",
href=re.compile("^(/wiki/)((?!;).)*$")):

JR  Oct 04, 2017 
PDF Page 34
2nd paragraph

When I ran the code in page 34 2nd Paragraph. I got the error message as follows:

AttributeError: 'NoneType' object has no attribute 'find_all'


Barrick Chang  Aug 31, 2020 
Printed Page 35
First Paragraph through code midway down

The issue about the colon vs. semicolon in other errata is moot: colons do appear in valid links in the Kevin Bacon wikipedia page. Examples include:

/wiki/Tremors_5:_Bloodline
/wiki/X-Men:_First_Class

Therefore the regular expression seeking to exclude all non-content links and include only content links excludes at least two content links.

Anonymous  May 04, 2018 
PDF Page 35
2nd paragraph

When I ran the code block in 2nd Paragraph, I got the following error message:

AttributeError: 'NoneType' object has no attribute 'find_all'

Anonymous  Aug 31, 2020 
PDF Page 64
6th paragraph

If the pages are all similar (they all have basically the same types of content), you may want to add a pageType attribute to your existing web-page object:

class Website:
"""Common base class for all articles/pages"""
-------------------------------

Should the class not be named: class Webpage ?

Ron ter Borg  Dec 22, 2018 
PDF Page 64-65
end 64, begin 65

The classes Website and Webpage (and hence the derived subclasses) have been used unthoughtfully.
I think all classes should be Webpage and the subclasses Product and Article should extend from Webpage.

Ron ter Borg  Dec 22, 2018 
PDF Page 73
last paragrafh before code

Using two separate Rule and LinkExtractor classes with a single parsing function...

This is not correct. Rule is a function and not a class.

Ron ter Borg  Dec 23, 2018 
Printed Page 76
Code at top of page

Code for storing data into a csv file pages 75-76. Code with line "writer.writerow(csvRow)" is indented when it should not be. As code is printed in book, this line of code writes to the csv file every time the nested for loop is ran. Instead I believe it should be 'unindented' to only write to the csv file when "csvRow" has the entire row of information.

AV  Apr 25, 2019 
PDF Page 93
4th paragraph

timestamp column

should be:

created column

Ron ter Borg  Dec 27, 2018