Errata

Errata for Web Scraping with Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
Printed, PDF, ePub	last code block	Running the table to csv code (to turn a wikipedia table into a csv file) only captures the headers. The cells aren't filled with anything. Note from the Author or Editor: Formatting error causes inner "for" loop to be outdented, causing the logic in the code to break. The code on Github is correct: https://github.com/REMitchell/python-scraping/blob/master/chapter5/3-scrapeCsv.py	Anonymous	Jul 26, 2015	Oct 30, 2015
ePub		Traceback (most recent call last): File "/home/dave/python/scrape_add_to_db.py", line 28, in <module> links = getLinks("/wiki/Kevin_Bacon") File "/home/dave/python/scrape_add_to_db.py", line 22, in getLinks title = bsObj.find("h1").find("span").get_text() AttributeError: 'NoneType' object has no attribute 'get_text' I'm pretty sure that the error "None" means some problem downloading the url, but I know that I got pymysql working and changed my character sets. I thought that kindle might have mangled your nice code again so I went to github and copied and pasted the code, still same error. This is chapter 5 about 34% into the book (no page number on Kindle). Note from the Author or Editor: Unfortunately, Wikipedia has removed span tags from its titles, breaking some of the code in the book. This can be fixed by removing "find("span")" from the code, and just writing: title = bsObj.find("h1").get_text() This will be fixed in ebook editions and updated for future print editions.	Anonymous	Jul 27, 2015	Oct 30, 2015
ePub, Mobi, , Other Digital Version	Chapter 8 Reading and Writing Natural Languages; Kindle Locations 3344-3345;	missing the 's' in word bigramsDist in the line of code: bigramDist[("Sir", "Robin")] Note from the Author or Editor: Good catch! Have fixed for upcoming prints/ebook releases.	golfpsy101	Aug 11, 2015	Oct 30, 2015
Other Digital Version	Chapter 8 Reading and Writing Natural Languages; Kindle Locations 3401-3406	The text coloring is not consistent for the string in the line of code: text = word_tokenize("Strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.") Note from the Author or Editor: Will be fixed in the ebook and upcoming printings of the book.	golfpsy101	Aug 11, 2015	Oct 30, 2015
ePub	Page 2 Chapter 2	In chapter 2, "Advanced HTML Parsing",I've found the following two errors: (1) in the section titled "A Caveat to the keyword Argument", there is a sentence that begins with 'Alternatively, you can enclose class in quotes'. The sample code that follows 'bsObj.findall("", {"class":"green"}' is missing the right parenthesis. (2) Once again in chapter 2, "Advanced HTML Parsing", in the section titled "Other BeautifulSoup Objects" there is a sentence that is indented under "Tag objects" that ends in a colon (':'). The colon, traditionally and grammatically, signals that additional information follows but none does (follow). Is this an grammar typo or is the text that follows the colon actually missing? Please accept my apology for not providing page numbers but my ePub version of your book does not contain page numbering on my Kindle Fire. I now have a valid reason why I should not buy eBooks. From hereon, I'll stick to printed technical books: they have always served me well. Not to lay the blame at your feet, but I'm going to buy your print version. I'm working on a project and I don't need the distractions. Note from the Author or Editor: On page 17, the line should read: bsObj.findAll("", {"class":"green"}) On page 18, the line: bsObj.div.h1 Should be moved from its original position and placed under the description of "Tag objects" where it says "Retrieved in lists or individually by calling find and findAll on a BeautifulSoup object, or drilling down, as in:" What follows this sentence should be the example "bsObj.div.h1"	Anonymous	Jul 02, 2015	Jul 22, 2015
Printed, PDF, ePub	Page 16 3rd code example showing how to return both red and green span tags	.findAll("span", {"class": "green", "class": "red"}) an attempt to create a Python dict with repeated keys will preserve just the last one that is entered in the dict. The correct would be: .findAll("span", {"class": {"green", "red"}} Note that we're passing now a collection (set) as the value for the "class" key on the attributes dict. Note from the Author or Editor: The line on page 16, in Chapter 2, should read: .findAll("span", {"class":{"green", "red"}})	Anonymous	Jul 04, 2015	Jul 22, 2015
Printed	Page 16 line 11 from bottom, 2nd paragraph from bottom, 3rd sentence	"If it is false,"should be read as "If it is False," Note from the Author or Editor: This will be fixed in upcoming prints and editions	Toshi Kurokawa	Dec 17, 2015	Oct 30, 2015
Printed	Page 16 last line of footnote	the section BeautifulSoup and regular expressions. should be read as the section "Regular Expressions and BeautifulSoup."	Toshi Kurokawa	Dec 17, 2015	Oct 30, 2015
PDF	Page 16 line 12 from bottom	“If recursion is set to True”should be read as “If recursive is set to True” Note from the Author or Editor: Fixed in upcoming prints	Toshi Kurokawa	Dec 29, 2015	Oct 30, 2015
PDF, ePub	Page 18 8th paragrah	The paragraph states: "Retrieved in lists or individually by calling find and findAll on a BeautifulSoup object, or drilling down, as in:" It ends with a colon, but it is followed by a new paragraph. Suggestion: It looks like the 4th paragraph (a line with only "bsObj.div.h1") should be moved there instead, and not simply removed, as suggested in the Note from the Author or Editor.	Anonymous	Jul 17, 2015	Jul 22, 2015
Printed	Page 20 last line of 3rd paragraph	The 'body' of "body tag" should be Bold font.	Toshi Kurokawa	Dec 17, 2015	Mar 18, 2016
PDF	Page 22 line 12 from bottom, in the tree	"- s<td> (2)"should be read as "- <td> (2)"	Toshi Kurokawa	Dec 17, 2015	Oct 30, 2015
PDF	Page 23 line 17 and 22 from top	The linear rule number 4 at line 17 says; "4. Optionally, write the letter "d" at the end." which does not say blank at the end, however, the line 22 regEx says aabbbbb(cc)(d \| ), where blank comes at the end. this should be read as the following to be consistent with the rule. aabbbbb(cc)(d\|). Note from the Author or Editor: Changed text to different, more useful, example	Toshi Kurokawa	Dec 17, 2015	Mar 18, 2016
PDF, ePub	Page 27 9th paragraph	The text reads: " from urllib.request import urlopenfrom bs4 import BeautifulSoupimport re " It should be: " from urllib.request import url open from bs4 import BeautifulSoup import re "	lbrancolini	Jul 17, 2015	Jul 22, 2015
PDF	Page 41-42 code snippet, followExternalOnly 3rd Printing	The code has serious bugs in handling internal Links. Here is a debugged code: from urllib.request import urlopen from urllib.error import HTTPError from urllib.parse import urlparse from bs4 import BeautifulSoup import re import random #Retrieves a list of all Internal links found on a page def getInternalLinks(bsObj, includeUrl): internalLinks = [] #Finds all links that begin with a "/" for link in bsObj.findAll("a", href=re.compile("^(\/\|.(http:\/\/"+includeUrl+")).")): if link.attrs['href'] is not None and len(link.attrs['href']) != 0: if link.attrs['href'] not in internalLinks: internalLinks.append(link.attrs['href']) return internalLinks #Retrieves a list of all external links found on a page def getExternalLinks(bsObj, url): excludeUrl = getDomain(url) externalLinks = [] #Finds all links that start with "http" or "www" that do #not contain the current URL for link in bsObj.findAll("a", href=re.compile("^(http)((?!"+excludeUrl+").)$")): if link.attrs['href'] is not None: if link.attrs['href'] not in externalLinks: externalLinks.append(link.attrs['href']) return externalLinks def getDomain(address): return urlparse(address).netloc def followExternalOnly(bsObj, url): externalLinks = getExternalLinks(bsObj, url) if len(externalLinks) == 0: print("Only internal links here. Try again.") internalLinks = getInternalLinks(bsObj, getDomain(url)) if len(internalLinks) == 0: return if len(internalLinks) == 1: randInternalLink = internalLinks[0] else: randInternalLink = internalLinks[random.randint(0, len(internalLinks)-1)] if randInternalLink[0:4] != 'http': randInternalLink = 'http://'+getDomain(url)+randInternalLink if randInternalLink == url and len(internalLinks) == 1: return bsObjnext = BeautifulSoup(urlopen(randInternalLink), "html.parser") #Try again followExternalOnly(bsObjnext, randInternalLink) else: randomExternal = externalLinks[random.randint(0, len(externalLinks)-1)] try: nextBsObj = BeautifulSoup(urlopen(randomExternal), "html.parser") print(randomExternal) #Next page! followExternalOnly(nextBsObj, randomExternal) except HTTPError: #Try again print("Encountered error at "+randomExternal+"! Trying again") followExternalOnly(bsObj, url) url = "http://oreilly.com" bsObj = BeautifulSoup(urlopen(url), "html.parser") #Recursively follow external links followExternalOnly(bsObj, url) Note from the Author or Editor:* This code has been updated on Github and will be fixed in upcoming prints and editions of the book	Toshi Kurokawa	Jan 03, 2016	Mar 18, 2016
PDF	Page 42 Inside the 'getRandomExternalLink' function.	Inside the getRandomExternalLink function in the if/else statement, the 'if' statement is set to return 'getNextExternalLink' if the length of externalLinks is equal to zero. The 'getNextExternalLink' was never defined. Note from the Author or Editor: Updated code can be found in the github repository at: https://github.com/REMitchell/python-scraping/blob/master/chapter3/4-getExternalLinks.py	Anonymous	Sep 14, 2015	Oct 30, 2015
PDF	Page 42 Line 5 from top, the comment	#Finds all links that start with "http" or "www" that do Should be read as #Finds all links that start with "http" that do To reflect the revised code line 8 from top Note from the Author or Editor: Changed the code to reflect this comment	Toshi Kurokawa	Jan 01, 2016	Mar 18, 2016
PDF	Page 42 the bottom example lines	Random external link is: http://igniteshow.com/ Random external link is: http://feeds.feedburner.com/oreilly/news Random external link is: http://hire.jobvite.com/CompanyJobs/Careers.aspx?c=q319 Random external link is: http://makerfaire.com/ Should be read as http://igniteshow.com/ http://feeds.feedburner.com/oreilly/news http://hire.jobvite.com/CompanyJobs/Careers.aspx?c=q319 http://makerfaire.com/ Reflecting revised code print function, line 10 from the bottom of code snippet. Note from the Author or Editor: Updated code to reflect printout	Toshi Kurokawa	Jan 01, 2016	Mar 18, 2016
PDF	Page 45 The bottom schema	The directory structure is different from the shown as: • scrapy.cfg — wikiSpider — __init.py__ — items.py This should be the following: —scrapy.cfg — wikiSpider 　— __init.py__ 　— items.p	Toshi Kurokawa	Dec 17, 2015	Mar 18, 2016
PDF	Page 46 1st sentence	The 1st sentence: In order to create a crawler, we will add a new file to wikiSpider/wikiSpider/spiders/ articleSpider.py called items.py. Should be read as: In order to create a crawler, we will add a new file, articleSpider.py, to wikiSpider/wikiSpider/spiders/.	Toshi Kurokawa	Dec 17, 2015	Mar 18, 2016
PDF	Page 46 Bottom paragraph, 3rd line and 2nd line	The two words “WikiSpider”should be read as “wikiSpider”.	Toshi Kurokawa	Dec 17, 2015	Oct 30, 2015
PDF	Page 48 Side bar ‘Logging with Scrapy’Last sentence	The last sentence tells: This will create a new logfile, if one does not exist, in your current directory and output all logs and print statements to it. this should be read as This will create a new logfile, if one does not exist, in your current directory and output all logs to it.	Toshi Kurokawa	Dec 17, 2015	Mar 18, 2016
PDF	Page 58 1st code after side-bar of Twitter Credential Permissions	from twitter import Twitter shoulde be read as from twitter import Twitter, OAuth	Toshi Kurokawa	Dec 18, 2015	Oct 30, 2015
PDF	Page 62 2nd paragraph, 1st sentence	Google’s Geocode API, should be read as Google’s Geocoding API	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 67 code snippe	insert import json this is missing in https://github.com/REMitchell/python-scraping/blob/master/chapter4/6-wikiHistories.py Note from the Author or Editor: The import statement has been added for future versions of the book	Toshi Kurokawa	Dec 29, 2015	Mar 18, 2016
Printed	Page 73 getAbsoluteURL()	#second elif: url = source[4:] url = "http://"+source #should be: url = "http://"+source[4:]	Lem Dulfo	Sep 13, 2015	Oct 30, 2015
PDF	Page 73 last for loop of code snippet	The last part of code snippet: bsObj = BeautifulSoup(html) downloadList = bsObj.findAll(src=True) for download in downloadList: fileUrl = getAbsoluteURL(baseUrl, download["src"]) if fileUrl is not None: print(fileUrl) urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory)) should be for download in downloadList: fileUrl = getAbsoluteURL(baseUrl, download["src"]) if fileUrl is not None: print(fileUrl) urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory)) Note from the Author or Editor: This was caused by an indentation error. It has been fixed in Github and will be fixed for future editions and prints of the book.	Toshi Kurokawa	Jan 05, 2016	Mar 18, 2016
Printed, PDF, ePub	Page 84 block of code	The line of code: import re is missing: a Regular Expression is used at the end of the getLinks function: return bsObj.find("div",{"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))	Anonymous	Jul 21, 2015	Oct 30, 2015
PDF	Page 88 import statements in code snippet	from urllib.request import urlopen are appear twice – reduntdant.	Toshi Kurokawa	Jan 01, 2016	Mar 18, 2016
PDF	Page 94 last sentence before the section, Text	In this chapter, I’ll cover several commonly encountered types of files: text, PDFs, PNGs, and GIFs. However the PNG and GIF are not covered. It should be read as: In this chapter, I’ll cover several commonly encountered types of files: text, PDFs, and .docx.	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 98 4th paragraph	Whereas the European Computer Manufacturers Association’s website has this tag However, it is now officially ECMA International, so it should be read as: Whereas the ECMA International’s website has this tag	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 104 output of the <w:t>tag, last ouput example	This is a Word document, full of content that you want very much. Unfortunately, it’s difficult to access because I’m putting it on my website as a . docx file, rather than just publishing it as HTML should be read as This is a Word document, full of content that you want very much. Unfortunately, it’s difficult to access because I’m putting it on my website as a . docx file, rather than just publishing it as HTML	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
Printed	Page 113 8th paragraph, final code on page	In the Data Normalization section of chapter 7, there is a reference to recording the frequency of the 2-grams, then at the bottom of the page we are given a code snippet that introduces OrderedDict and uses the sorted function. In the sorted function the code contains ngrams.items() however the ngrams method returns a list and lists do not have an items() method. So the program generates an error. In the next chapter, it looks like the code (at least on GitHub) has the ngrams function return a dictionary instead which allows the code in chapter 7 to work. Note from the Author or Editor: I mentioned the code that would accomplish this in passing, but did not actually include it. It will be included in future printings of the book, and in the ebook.	Micheal Beatty	Aug 16, 2015	Oct 30, 2015
PDF	Page 113 line 4 output	("['Software', 'Foundation']", 40), ("['Python', 'Software']", 38),.... should be read as OrderedDict([("['Software', 'Foundation']", 40), ("['Python', 'Software']", 38),.... Note from the Author or Editor: Updated to: "OrderedDict([('of the', 38), ('Software Foundation', 37), ..."	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 113 line 12, output of ngramas	The current output is inconsistent with the code snippet. ("['Software', 'Foundation']", 40), ("['Python', 'Software']", 38), ("['of', 'th e']", 35), ("['Foundation', 'Retrieved']", 34), ("['of', 'Python']", 28), ("['in ', 'the']", 21), ("['van', 'Rossum']", 18) First, as the value of ngrams is an OrderedDict. Second, the getNgrams generate a string for 2gram instead of list of 2 strings. The actual output looks like the following: OrderedDict([('Software Foundation', 37), ('of the', 37), ('Python Software', 37), ('Foundation Retrieved', 32), ('of Python', 32), ('in the', 22), ('such as', 20), ('van Rossum', 19)... Note from the Author or Editor: Updated the output of the script to reflect the use of the OrderedDict	Toshi Kurokawa	Jan 02, 2016	Mar 18, 2016
PDF	Page 115 line 6 from the bottom	me data that contains four or more comma-seperated programming languages Should be read as me data that contains three or more comma-seperated programming languages	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 118 the last sentence	The last sentence refers: guide to the language can be found on OpenRefine’s GitHub page This pointer refers to https://github.com/sixohsix/twitter/tree/master, which is not the precise page for the OpenRefine guide documents. This should be https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 122 output bullets at the bottom	• The Constitution of the United States is the instrument containing this grant of power to the several departments composing the government. Should be read as • The Constitution of the United States is the instrument containing this grant of power to the several departments composing the Government. The general government has seized upon none of the reserved rights of the states. Should be read as The General Government has seized upon none of the reserved rights of the States.	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 123 bulletted ouput at the top	The presses in the necessary employment of the government should never be used to clear the guilty or to varnish crime. Should be read as The presses in the necessary employment of the Government should never be used to “clear the guilty or to varnish crime.”	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 123 The 2nd sentence, reference of That can be my next tweet! app	The link embedded in PDF for "That can be my next tweet!" is a wrong one, that should be http://yes.thatcan.be/my/next/tweet/ Note from the Author or Editor: The page has changed since the book was written. Updated for future editions	Toshi Kurokawa	Jan 01, 2016	Mar 18, 2016
PDF	Page 139 line 10 from bottom, the 1st bullet	name is email_address) should be read as name is email_addr)	Toshi Kurokawa	Dec 18, 2015	Oct 30, 2015
PDF	Page 139 line 4-5 from bottom in the code snippet	The part of code snippet r = requests.post("http://post.oreilly.com/client/o/oreilly/forms/ quicksignup.cgi", data=params) causes EOL error because of string break. It should be like the following: r = requests.post( "http://post.oreilly.com/client/o/oreilly/forms/quicksignup.cgi", data=params) Note from the Author or Editor: Because of the limitations of printing, there are many instances throughout the book where code needs to be cut off and continued on the next line. Please either correct these as you copy them from the book, or refer to the code repository on Github. In this case, I will use the suggested version, because it corrects an issue with the syntax highlighting caused with this particular line break.	Toshi Kurokawa	Jan 06, 2016	Mar 18, 2016
Printed	Page 141 Code sample on bottom of page	The code says `name="image"`, but following page suggests (and code on actual site is) `name="uploadFile"`.	Ian Gow	Jan 02, 2016	Mar 18, 2016
PDF	Page 142 4th paragraph from the top	Once a site authenticates your login credentials a it stores in your browser a cookie, Should be read as Once a site authenticates your login credentials, it stores in your browser a cookie, Note from the Author or Editor: Changed to "Once a site authenticates your login credentials it stores them in your browser’s cookie"	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 149 footnote	http://blog.jquery.com/2014/01/13/the-stateof-jquery-2014/ should be read a http://blog.jquery.com/2014/01/13/the-state-of-jquery-2014/	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 149 1st sentence of 2nd paragraph	If you find jQuery is found on a site, you must be careful when scraping it. jQuery is Should be read as If you find jQuery on a site, you must be careful when scraping it. jQuery is	Toshi Kurokawa	Dec 29, 2015	Mar 18, 2016
PDF	Page 154 code at the bottom and the line above	page has been fully loaded: from selenium import webdriver. from selenium.webdriver.common.by import By should be layouted as page has been fully loaded: from selenium import webdriver. from selenium.webdriver.common.by import By	Toshi Kurokawa	Dec 29, 2015	Mar 18, 2016
PDF	Page 162 line 6 from the top	The link for installing Pillow http://pillow.readthedocs.org/installation.html does not work, instead use http://pillow.readthedocs.org/en/3.0.x/ Note from the Author or Editor: The link has changed since publication, and is updated in future versions.	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 169 3rd Paragraph 1st sentence after :	Computer Automated Public Turing test to tell Computers and Humans Apart should be read as Completely Automated Public Turing test to tell Computers and Humans Apart	Toshi Kurokawa	Dec 18, 2015	Oct 30, 2015
ePub	Page 172 Figure 8.1	The diagram 8.1 about a Markov weather model has one incorrect percentage value and one incorrect arrow direction: 1. The value for Sunny being sunny the next day should be 70% rather than 20%. 2. The arrow for the 15% chance of Rainy being followed by Cloudy should be reversed so that this shows a 15% chance of Cloud being followed by Rain. Note from the Author or Editor: The description is correct. The corrected Markov diagram is: http://pythonscraping.com/img/markov_8.1.png	Dane Wright	Jul 17, 2015	Jul 22, 2015
PDF	Page 172 2nd paragraph from the bottom and the code snippet	The paragraph and the 1st code refer/define main as exist,in the https://github.com/REMitchell/tesseract-trainer/blob/master/trainer.py code referred at the preceding paragraph. However, there is no main method in this code example, instead you use __init__. So, the main should be read as __init__.	Toshi Kurokawa	Dec 29, 2015	Mar 18, 2016
PDF	Page 186 line 3 from the bottom	Use a tool such as Chrome’s Network inspector to Should be read as Use a tool such as Chrome’s Network panel to	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 221 last paragraph in side column	In the second scenario, the load your Internet connection and home machine can Should be read as In the third scenario, the load your Internet connection and home machine can	Toshi Kurokawa	Dec 18, 2015	Mar 18, 2016
PDF	Page 230 1st line	DMCS Safe Harbor should be read as DMCA Safe Harbor Note from the Author or Editor: Fixed in upcoming prints	Toshi Kurokawa	Dec 28, 2015	Mar 18, 2016