Errata

Practical Natural Language Processing

Errata for Practical Natural Language Processing

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
Printed Page 18
2nd

Bad citation/reference

Now:
Interested readers can look at software libraries such as pregex [11]. Last accessed June 15, 2020.

Should be:

Interested readers can look at software libraries such as pregex [11].

And in p. 34:
[11] Hewitt, Luke. Probabilistic regular expressions (https://oreil.ly/BqhJX), (Github repo). Last accessed June 15, 2020.

Arthur Mauricio Delgadillo  Mar 31, 2021 
Printed Page 24
Last paragraph

"More details on the usage CNNs for NLP can be found in [25] and [26]."

SHOULD BE -->>>

"More details on the usage of CNNs for NLP can be found in [25] and [26]."

Missing the word "of".

Tyler Procko  Nov 09, 2022 
PDF Page 25
First paragraph

Supplemental material (code examples, exercises, etc.) is available for download at
https://oreil.ly/PracticalNLP.

PS

the archive is corrupted.......impossible to unzip the files...
please help

armand  Oct 04, 2021 
Printed Page 26
First full paragraph

The text states that transformers are pretrained on over 40 GB worth of data from all over the internet. I think that must be an error. 40 GB seems way too small.

JoAnn Alvarez  Mar 28, 2021 
Other Digital Version 66
python code

probably user error, but when using the code:
html = urlopen(myurl).read()
i get a 403 error (access denied)
when i replace that code with following, everything following works.

import requests


html = requests.get(myurl).text
#html = urlopen(myurl).read() # query the website so that it returns a html page
soupified = BeautifulSoup(html, 'html.parser') # parse the html in the 'html' variable, and store it in Beautiful Soup format

Anonymous  Nov 22, 2023 
Printed Page 68
4th paragraph

summarization is subjective

Now:
For some cases, like machine translation or summarization, it's not always possible to automate evaluation since comparison is not subjective.

Actual:
For some cases, like machine translation or summarization, it's not always possible to automate evaluation since comparison is subjective.

Or

For some cases, like machine translation or summarization, it's not always possible to automate evaluation since comparison is not objective.

Arthur Mauricio Delgadillo  Apr 01, 2021 
Printed, PDF, ePub Page 69
Table 2-2 2nd and 3rd row, in description column, after i.e.

Precision [48] Shows how precise or exact the model’s predictions are, i.e., given all the positive (the class we care about) cases, how many can the model classify correctly?

Recall [48] Recall is complementary to precision. It captures how well the model can recall positive class, i.e., given all the positive predictions it makes, how many of them are indeed positive?

->
Precision [48] Shows how precise or exact the model’s predictions are, i.e., given all the positive predictions it makes, how many of them are indeed positive?

Recall [48] Recall is complementary to precision. It captures how well the model can recall positive class,i.e., given all the positive (the class we care about) cases, how many can the model classify correctly?

Anyway, this is the best book I've ever read in my life for the practical application of ML in industry. I highly appreciate this work and hope to contribute :)

Jin Tao  Jul 04, 2020 
Printed Page 74
Last paragraph

Typo "cbased"

Now:
Uber operates in 400+ cities worldwide, and cbased on ...

Should be:
Uber operates in 400+ cities worldwide, and based on ...

Arthur Mauricio Delgadillo  Apr 01, 2021 
Printed Page 91
Table 3-2

The TF score of dog in the corpus is 0.33. However, the total number of terms in the document is 12, with dog occuring 3 times. The TF score should therefore be 0.25. The same holds for man.

Anonymous  Dec 31, 2021 
Printed Page 91
Table 3-2

Table 3-2 regarding TF-IDF appears to containing incorrect TF values for "D1" (document 1) "Dog bites man". I thought the TF score is relative to the specific document being analyzed. Document 1 only contains 3 total words, so shouldn't the denominator of each TF score calculation be out of 3 (and all numerators either being 1 or 0 since there are no repeated words in document 1)? Making dog 1/3 (correctly), bites 1/3 (incorrectly stated as 1/6), man 1/3 (correctly listed 0.33), eats 0/3 (incorrectly listed as 0.17), meat 0/3 (0/12 incorrectly listed as 1/12=0.083), food 0/3 (incorrectly listed as 0.083). Moreover, the fractions used in IDF score appear accurate relative to four documents listed in toy corpus pg. 85 Table 3-1, but it appears base 2 without adjustment is used for easier calculation (deviating from the formula stated above). Together, the overall calculation of TF-IDF score in the table needs adjustment.

Anonymous  Feb 05, 2022 
Printed Page 91
TF-IDF paragraph

The TF-IDF example reports a wrong calculation for the word "dog" and the word "man". They both appear 3 times, so 3/12 is 1/4, which is 0.25 and not 1/3. So, even the TF-IDF calculus are wrong. They should be 0.4114 * 0.25 = 0,10285

Luciano  Oct 24, 2022 
Other Digital Version 96
jupyter notebook mentioned at the top of the page

This is an error with one of the jupyter notebooks, not an actual typo in the text. I wrote to O'Reilly support and they routed me to this website for getting help.

When running the "06_Training_embeddings_using_gensim" notebook for Chapter 3, I get a "OSError: Invalid data stream"

Below you'll find the code I run. It is straightforward from the notebook. I've also tried adding "processes=4" and changing the value to 1 or 1. I have the same libraries the code requires as well.
There is no documentation of this error in Stack Overflow either. Any help will be highly appreciated.
Norma

#Preparing the Training data
wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())

---------------------------------------------------------------------------
OSError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_14660\639866601.py in <module>
1 #Preparing the Training data
2 wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})
----> 3 sentences = list(wiki.get_texts())
4
5 #if you get a memory error executing the lines above

~\anaconda3\lib\site-packages\gensim\corpora\wikicorpus.py in get_texts(self)
673 # process the corpus in smaller chunks of docs, because multiprocessing.Pool
674 # is dumb and would load the entire input into RAM at once...
--> 675 for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
676 for tokens, title, pageid in pool.imap(_process_article, group):
677 articles_all += 1

~\anaconda3\lib\site-packages\gensim\utils.py in chunkize(corpus, chunksize, maxsize, as_numpy)
1232
1233 """
-> 1234 for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy):
1235 yield chunk
1236 else:

~\anaconda3\lib\site-packages\gensim\utils.py in chunkize_serial(iterable, chunksize, as_numpy, dtype)
1147 wrapped_chunk = [[np.array(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]]
1148 else:
-> 1149 wrapped_chunk = [list(itertools.islice(it, int(chunksize)))]
1150 if not wrapped_chunk[0]:
1151 break

~\anaconda3\lib\site-packages\gensim\corpora\wikicorpus.py in <genexpr>(.0)
665 tokenization_params = (self.tokenizer_func, self.token_min_len, self.token_max_len, self.lower)
666 texts = \
--> 667 ((text, self.lemmatize, title, pageid, tokenization_params)
668 for title, text, pageid
669 in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces, self.filter_articles))

~\anaconda3\lib\site-packages\gensim\corpora\wikicorpus.py in extract_pages(f, filter_namespaces, filter_articles)
408 # those from the first element we find, which will be part of the metadata,
409 # and construct element paths.
--> 410 elem = next(elems)
411 namespace = get_namespace(elem.tag)
412 ns_mapping = {"ns": namespace}

~\anaconda3\lib\site-packages\gensim\corpora\wikicorpus.py in <genexpr>(.0)
402
403 """
--> 404 elems = (elem for _, elem in iterparse(f, events=("end",)))
405
406 # We can't rely on the namespace for database dumps, since it's changed

~\anaconda3\lib\xml\etree\ElementTree.py in iterator()
1222 yield from pullparser.read_events()
1223 # load event buffer
-> 1224 data = source.read(16 * 1024)
1225 if not data:
1226 break

~\anaconda3\lib\bz2.py in read(self, size)
176 with self._lock:
177 self._check_can_read()
--> 178 return self._buffer.read(size)
179
180 def read1(self, size=-1):

~\anaconda3\lib\_compression.py in readinto(self, b)
66 def readinto(self, b):
67 with memoryview(b) as view, view.cast("B") as byte_view:
---> 68 data = self.read(len(byte_view))
69 byte_view[:len(data)] = data
70 return len(data)

~\anaconda3\lib\_compression.py in read(self, size)
101 else:
102 rawblock = b""
--> 103 data = self._decompressor.decompress(rawblock, size)
104 if data:
105 break

OSError: Invalid data stream

Norma Grubb  Apr 18, 2022 
Printed Page 109
2nd paragraph

Original: "Here, the images are passed through a convolution neural network and the final feature vectors."

Suggestion 1: Change "convolution" to convolutional.
Suggestion 2: Change "and" to "for"?

Anonymous  Feb 05, 2022 
Printed Page 110
Figure 3-15

Original: " ( eft: numbers, right: job titles [33] "

Suggestion: "( left: numbers, right job titles) [33]"

Anonymous  Feb 05, 2022 
PDF Page 118
2



Small error:
One french word has not been translated

J’ai mangé trois filberts => J’ai mangé trois noisettes

Congratulations for your book.
A french guy ;)

Anonymous  Jul 24, 2020 
Printed Page 139
6th sentence

The sentence is not complete. It ends with 'and multilabel classification problems using .'

Anonymous  Jan 17, 2022 
PDF Page 257
code block

I am really thankful to Oreilly for providing the book Practical Natural Language Processing. This is really an amazing practical NLP book.
I have followed the code supplements of the chapter Text Classification.IN that chapter word2vec has been implemented by a pre-trained model. But I am facing trouble implementing word2vec to train my own model using train test split. Here, I want to build a sentiment analysis model for Twitter data related to covid 19.
Let me explain my problem in detail. Suppose we are implementing Bag of Words. Let's say we are using the following code snippet
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
vect = CountVectorizer(preprocessor=clean)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
here, after splitting the data set into training and testing,we fit the model only with training data and then with restpect to this transforming both training and testing data.Now suppose I want to do this with word2vec.Should I train the word2vec model only with X_train with codes like the following?
our_model = Word2Vec(X_Train, size=10, window=5, min_count=1, workers=4)??
But my question is how would I transform this vectorization with test data? There is no such method '.transform' in Gensim like Scikitlearn. Probably for Doc2vec, there is a method called 'infer_vector'.
Another question is after vectorization of the whole corpus with Gensim's Word2vec how can I get the complete vector space representation of total vocabulary? Is there any method in Gensim's Word2vec?
It will be very helpful if you kindly provide python code implementation to train our own embeddings using word2vec after using train-test split.

Anonymous  Mar 15, 2021 
Printed Page 299
First line of the page

The page starts with “but they’re not not Twitter or SMTD specific.” I think not is duplicated and it should only appear once.

Rodrigo Ávila  Jan 09, 2023