Errata

Applied Text Analysis with Python

Errata for Applied Text Analysis with Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
chapter 4
first code example

The class definition needs a colon at the end in order to compile.

class Estimator(BaseEstimator)

should be

class Estimator(BaseEstimator):

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 13, 2018 
chapter 6
last sentence of last paragraph before conclusion

In " without any addition changes needed:"

replace "addition" with "additional"

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 09, 2018 
chapter 6
7th paragraph after Latent Dirichlet Allocation (LDA) heading

"Under our TopicModels class"

There is no class with the exact name "TopicModels" in any of the code shown. You likely mean "SklearnTopicModels".

Note from the Author or Editor:
This sentence is no longer in the chapter.

Anonymous  Mar 09, 2018 
chapter 6
8th paragraph after Latent Dirichlet Allocation (LDA) heading

"each of topic" is not grammatical.

Better replacements:
"each topic"
OR
"each of the topics"

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 09, 2018 
chapter 6
paragraph before the classify method code listing

The following sentence is ungrammatical:
"Inside our KMeansTopics class, we’ll add a classify() method that takes as an argument a document from the corpus and using the classify method from NLTK’s KMeansClusterer class to accomplish this."

The reason it is ungrammatical is that you conjoined two verb phrases, but the tenses don't match.
The main problem is the word "using".

You have to replace "using" with "uses" if you mean the classify method uses the classify method.
You have to replace "using" with "use" if you mean we will use the classify method.

Note from the Author or Editor:
This sentence is no longer in the chapter.

Anonymous  Mar 09, 2018 
chapter 6
In a paragraph far into the K-MEANS AND MINIBATCH K-MEANS section

Here is the problem paragraph: "Inside our KMeansTopics class, we’ll create a vectorize method, which takes a document (a list of (tag, token) tuples, vectorizes it using one-hot encoding, and returns a NumPy array representation of the document."

The problems with the text are:
1. The text claims that the tuples are (tag, token) when they really are (token, tag). The order of elements in the text, is opposite to how it is in the code.

2. you are missing a closing parenthesis. You have two open parenthesis.

Note from the Author or Editor:
This sentence is no longer in the chapter.

Anonymous  Mar 09, 2018 
chapter 6
in the paragraph before the cluster method code listing (1/3rd way through the chapter)

The following conjugates the word "use" incorrectly, the subject of the verb "use" is "the algorithm" which is 3rd person singular, and so the verb "use" here should be conjugated as "uses", that is with an "s" at the end.

Where the problem is:
"This method will initialize the clustering algorithm with our k as well as two hyperparameters that specify that the algorithm use cosine distance and avoid a result that has any clusters that contain no documents."

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 09, 2018 
Chapter 6
second paragraph after K-MEANS AND MINIBATCH K-MEANS heading

"is convenient place to start"
this noun phrase needs a determiner.
I.e. add the word "a" between "is" and "convenient"
"is a convenient place to start"

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 09, 2018 
chapter 6
Section: distance metrics: paragraph under Venn Diagram figure

"There are multiple implementations of edit distance, all variations on Levenstein distance, but which exact differing penalties for insertions, deletions and substitutions, as well as potentially increased penalties for gaps and transpositions."

The sentence is not a complete sentence, because the clause starting with "which" is missing a verb.

The sentence makes sense if you replace the word "which" with the word "with".

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 09, 2018 
chapter 5
conclusion

The following sentence is ungrammatical:
"we will discuss the another prominent, although different, use of machine learning on text"

because it has the word "the" followed by "another". Remove "the".
Replace "the another" with "another".

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
Chapter 5
Conclusion

You are missing the subject from the sentence that starts with
"Additionally, learned about named"
.
Put the word "we" before "learned" to fix it.

Note from the Author or Editor:
This sentence is no longer in the chapter.

Anonymous  Mar 08, 2018 
chapter 5
3rd paragraph of "Building the Training Data"

The subject is missing from the beginning of the sentence:
"Next, will create an extract_entities function"

A good rewrite would put the word "we" before "will".

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
chapter 5
paragraph that starts "The precision of a class, A" (in the exact middle of the chapter)

1. overuse of monospace font for whole sentences.
2. unclear notation specifically in the difference between
A+
+A+
and
+¬A+

I think "under balanced" is not the best choice of phrase, perhaps "underrepresented" is better.

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
Chapter 5,
Model construction: third paragraph: first sentence

"it's" should be "its" (i.e. without the apostrophe) in "as it’s first argument"

Note from the Author or Editor:
This sentence is no longer in the chapter.

Anonymous  Mar 08, 2018 
chapter 5,
first paragraph of "Model Construction" section: second last sentence

"and a Support Vector Machines."


"Support Vector Machines" is a plural noun, and "a" is a determiner that does not take a plural noun. The noun phrase is ungrammatical.

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
chapter 5
paragraph before heading titled "Cross-Validation"

In the text
"simply from it’s text"

Remove the apostrophe from "it's".

Note from the Author or Editor:
This sentence is no longer in the chapter.

Anonymous  Mar 08, 2018 
chapter 5
2nd paragraph, third sentence

The sentence in which the following text occurs:
"may have have classifiers"
: is ungrammatical because of an extra occurrence of the word "have".

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
Chapter 4
Section titled "Building models": third paragraph: first sentence

The following text
"Note the numerous transformers that we include in for feature extraction"

sounds better without the word "in".

Note from the Author or Editor:
This sentence is no longer in the chapter.

Anonymous  Mar 08, 2018 
chapter 4
second last paragraph of Text Normalization

In the text:
"These mtehods"

The second word is misspelled.

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
chapter 4
last paragraph in Corpus Loader section

2 issues in the following text:
"it allows us to load documents from disk and add send them into the Pipeline"

1. "add send" is ungramattical. Replace it with either "add" or "send" but not both.
2. I'm not sure why pipeline starts with a capital letter.

Note from the Author or Editor:
This section is no longer in the chapter.

Anonymous  Mar 08, 2018 
chapter 4
5th paragraph of Section titled Corpus Loader

In "so that we can easily look up the number folds specified by the loader".
Replace "number folds"
with "number of folds"

Note from the Author or Editor:
This section is no longer in the chapter.

Anonymous  Mar 08, 2018 
Chapter 4
last paragraph of Benefits and Limits of Vector Encoding

"a pure bag-of-words model works about 85% of time"

In the above sentence, replace "of time" with "of the time", to make it sound more idiomatic.

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
chapter 4
last paragraph of Benefits and Limits of Vector Encoding


Quoted text from chapter 4: "Nonetheless, using simple models that consider only of word frequencies will often be very successful;"

The problem with the sentence above is that "consider of" is not grammatical. The word "consider" does not take "of" as a complement.

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
chapter 4
in a code example

In a code example in chapter 4 (approximately half way through the chapter), "Dictionary" is misspelled as "Dictionar", that is missing the last "y". The code will not work because there is no package called "Dictionar".
lexicon = gensim.corpora.Dictionar.load_from_text('lexicon.txt')

Note from the Author or Editor:
This has been corrected.

Anonymous  Mar 08, 2018 
chapter 1
chapter 1: The Data Product Pipeline: 2nd paragraph = under Figure 1-2

In the text "does not differ sign from the pipeline", replace the word "sign" with "significantly". I believe this is a typo.

Note from the Author or Editor:
This has already been corrected.

Anonymous  Mar 01, 2018 
chapter 3
chapter 3: Model selection as search: second paragraph: middle of last sentence

"able leverage" should be "able to leverage" in order for the sentence to be grammatical.

Note from the Author or Editor:
That section is no longer in the chapter.

Anonymous  Mar 01, 2018 
Chapter 2 "Text Acquisition and Ingestion", Title: "APIs: Twitter and Search"

In the code snippet that shows how to access the twitter API, you can read:

...
users = ["tonyojeda3","bbengfort","RebeccaBilbro","OReillyMedia",
"datacommunitydc","dataelixir","pythonweekly","KirkDBorne"]

def get_tweets(user_list, tweets=20):
for user in users:
...

I guess it should be "for user in user_list:"

Great book, thank you.

ivan  Aug 13, 2017 
3
6th paragraph in Corpus Disk Structure

The file should be manifest.json in the below code snippet as per the comment-

import json

# In a custom corpus reader class
def manifest(self):
"""
Reads and parses the manifest.json file in our corpus if it exists.
"""
return json.load(self.open("README"))

Deepak Mahendrakar  Jun 22, 2017