Errata

Errata for Practical Natural Language Processing

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date submitted
Printed	Page 18 2nd	Bad citation/reference Now: Interested readers can look at software libraries such as pregex [11]. Last accessed June 15, 2020. Should be: Interested readers can look at software libraries such as pregex [11]. And in p. 34: [11] Hewitt, Luke. Probabilistic regular expressions (https://oreil.ly/BqhJX), (Github repo). Last accessed June 15, 2020.	Arthur Mauricio Delgadillo	Mar 31, 2021
Printed	Page 24 Last paragraph	"More details on the usage CNNs for NLP can be found in [25] and [26]." SHOULD BE -->>> "More details on the usage of CNNs for NLP can be found in [25] and [26]." Missing the word "of".	Tyler Procko	Nov 09, 2022
PDF	Page 25 First paragraph	Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/PracticalNLP. PS the archive is corrupted.......impossible to unzip the files... please help	armand	Oct 04, 2021
Printed	Page 26 First full paragraph	The text states that transformers are pretrained on over 40 GB worth of data from all over the internet. I think that must be an error. 40 GB seems way too small.	JoAnn Alvarez	Mar 28, 2021
Other Digital Version	66 python code	probably user error, but when using the code: html = urlopen(myurl).read() i get a 403 error (access denied) when i replace that code with following, everything following works. import requests html = requests.get(myurl).text #html = urlopen(myurl).read() # query the website so that it returns a html page soupified = BeautifulSoup(html, 'html.parser') # parse the html in the 'html' variable, and store it in Beautiful Soup format	Anonymous	Nov 22, 2023
Printed	Page 68 4th paragraph	summarization is subjective Now: For some cases, like machine translation or summarization, it's not always possible to automate evaluation since comparison is not subjective. Actual: For some cases, like machine translation or summarization, it's not always possible to automate evaluation since comparison is subjective. Or For some cases, like machine translation or summarization, it's not always possible to automate evaluation since comparison is not objective.	Arthur Mauricio Delgadillo	Apr 01, 2021
Printed, PDF, ePub	Page 69 Table 2-2 2nd and 3rd row, in description column, after i.e.	Precision [48] Shows how precise or exact the model’s predictions are, i.e., given all the positive (the class we care about) cases, how many can the model classify correctly? Recall [48] Recall is complementary to precision. It captures how well the model can recall positive class, i.e., given all the positive predictions it makes, how many of them are indeed positive? -> Precision [48] Shows how precise or exact the model’s predictions are, i.e., given all the positive predictions it makes, how many of them are indeed positive? Recall [48] Recall is complementary to precision. It captures how well the model can recall positive class,i.e., given all the positive (the class we care about) cases, how many can the model classify correctly? Anyway, this is the best book I've ever read in my life for the practical application of ML in industry. I highly appreciate this work and hope to contribute :)	Jin Tao	Jul 04, 2020
Printed	Page 74 Last paragraph	Typo "cbased" Now: Uber operates in 400+ cities worldwide, and cbased on ... Should be: Uber operates in 400+ cities worldwide, and based on ...	Arthur Mauricio Delgadillo	Apr 01, 2021
Printed	Page 91 Table 3-2	The TF score of dog in the corpus is 0.33. However, the total number of terms in the document is 12, with dog occuring 3 times. The TF score should therefore be 0.25. The same holds for man.	Anonymous	Dec 31, 2021
Printed	Page 91 Table 3-2	Table 3-2 regarding TF-IDF appears to containing incorrect TF values for "D1" (document 1) "Dog bites man". I thought the TF score is relative to the specific document being analyzed. Document 1 only contains 3 total words, so shouldn't the denominator of each TF score calculation be out of 3 (and all numerators either being 1 or 0 since there are no repeated words in document 1)? Making dog 1/3 (correctly), bites 1/3 (incorrectly stated as 1/6), man 1/3 (correctly listed 0.33), eats 0/3 (incorrectly listed as 0.17), meat 0/3 (0/12 incorrectly listed as 1/12=0.083), food 0/3 (incorrectly listed as 0.083). Moreover, the fractions used in IDF score appear accurate relative to four documents listed in toy corpus pg. 85 Table 3-1, but it appears base 2 without adjustment is used for easier calculation (deviating from the formula stated above). Together, the overall calculation of TF-IDF score in the table needs adjustment.	Anonymous	Feb 05, 2022
Printed	Page 91 TF-IDF paragraph	The TF-IDF example reports a wrong calculation for the word "dog" and the word "man". They both appear 3 times, so 3/12 is 1/4, which is 0.25 and not 1/3. So, even the TF-IDF calculus are wrong. They should be 0.4114 * 0.25 = 0,10285	Luciano	Oct 24, 2022
Other Digital Version	96 jupyter notebook mentioned at the top of the page	This is an error with one of the jupyter notebooks, not an actual typo in the text. I wrote to O'Reilly support and they routed me to this website for getting help. When running the "06_Training_embeddings_using_gensim" notebook for Chapter 3, I get a "OSError: Invalid data stream" Below you'll find the code I run. It is straightforward from the notebook. I've also tried adding "processes=4" and changing the value to 1 or 1. I have the same libraries the code requires as well. There is no documentation of this error in Stack Overflow either. Any help will be highly appreciated. Norma #Preparing the Training data wiki = WikiCorpus(file_name, lemmatize=False, dictionary={}) sentences = list(wiki.get_texts()) --------------------------------------------------------------------------- OSError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_14660\639866601.py in <module> 1 #Preparing the Training data 2 wiki = WikiCorpus(file_name, lemmatize=False, dictionary={}) ----> 3 sentences = list(wiki.get_texts()) 4 5 #if you get a memory error executing the lines above ~\anaconda3\lib\site-packages\gensim\corpora\wikicorpus.py in get_texts(self) 673 # process the corpus in smaller chunks of docs, because multiprocessing.Pool 674 # is dumb and would load the entire input into RAM at once... --> 675 for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1): 676 for tokens, title, pageid in pool.imap(_process_article, group): 677 articles_all += 1 ~\anaconda3\lib\site-packages\gensim\utils.py in chunkize(corpus, chunksize, maxsize, as_numpy) 1232 1233 """ -> 1234 for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy): 1235 yield chunk 1236 else: ~\anaconda3\lib\site-packages\gensim\utils.py in chunkize_serial(iterable, chunksize, as_numpy, dtype) 1147 wrapped_chunk = [[np.array(doc, dtype=dtype) for doc in itertools.islice(it, int(chunksize))]] 1148 else: -> 1149 wrapped_chunk = [list(itertools.islice(it, int(chunksize)))] 1150 if not wrapped_chunk[0]: 1151 break ~\anaconda3\lib\site-packages\gensim\corpora\wikicorpus.py in <genexpr>(.0) 665 tokenization_params = (self.tokenizer_func, self.token_min_len, self.token_max_len, self.lower) 666 texts = \ --> 667 ((text, self.lemmatize, title, pageid, tokenization_params) 668 for title, text, pageid 669 in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces, self.filter_articles)) ~\anaconda3\lib\site-packages\gensim\corpora\wikicorpus.py in extract_pages(f, filter_namespaces, filter_articles) 408 # those from the first element we find, which will be part of the metadata, 409 # and construct element paths. --> 410 elem = next(elems) 411 namespace = get_namespace(elem.tag) 412 ns_mapping = {"ns": namespace} ~\anaconda3\lib\site-packages\gensim\corpora\wikicorpus.py in <genexpr>(.0) 402 403 """ --> 404 elems = (elem for _, elem in iterparse(f, events=("end",))) 405 406 # We can't rely on the namespace for database dumps, since it's changed ~\anaconda3\lib\xml\etree\ElementTree.py in iterator() 1222 yield from pullparser.read_events() 1223 # load event buffer -> 1224 data = source.read(16 * 1024) 1225 if not data: 1226 break ~\anaconda3\lib\bz2.py in read(self, size) 176 with self._lock: 177 self._check_can_read() --> 178 return self._buffer.read(size) 179 180 def read1(self, size=-1): ~\anaconda3\lib\_compression.py in readinto(self, b) 66 def readinto(self, b): 67 with memoryview(b) as view, view.cast("B") as byte_view: ---> 68 data = self.read(len(byte_view)) 69 byte_view[:len(data)] = data 70 return len(data) ~\anaconda3\lib\_compression.py in read(self, size) 101 else: 102 rawblock = b"" --> 103 data = self._decompressor.decompress(rawblock, size) 104 if data: 105 break OSError: Invalid data stream	Norma Grubb	Apr 18, 2022
Printed	Page 109 2nd paragraph	Original: "Here, the images are passed through a convolution neural network and the final feature vectors." Suggestion 1: Change "convolution" to convolutional. Suggestion 2: Change "and" to "for"?	Anonymous	Feb 05, 2022
Printed	Page 110 Figure 3-15	Original: " ( eft: numbers, right: job titles [33] " Suggestion: "( left: numbers, right job titles) [33]"	Anonymous	Feb 05, 2022
PDF	Page 118 2	Small error: One french word has not been translated J’ai mangé trois filberts => J’ai mangé trois noisettes Congratulations for your book. A french guy ;)	Anonymous	Jul 24, 2020
Printed	Page 139 6th sentence	The sentence is not complete. It ends with 'and multilabel classification problems using .'	Anonymous	Jan 17, 2022
PDF	Page 257 code block	I am really thankful to Oreilly for providing the book Practical Natural Language Processing. This is really an amazing practical NLP book. I have followed the code supplements of the chapter Text Classification.IN that chapter word2vec has been implemented by a pre-trained model. But I am facing trouble implementing word2vec to train my own model using train test split. Here, I want to build a sentiment analysis model for Twitter data related to covid 19. Let me explain my problem in detail. Suppose we are implementing Bag of Words. Let's say we are using the following code snippet X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) vect = CountVectorizer(preprocessor=clean) X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) here, after splitting the data set into training and testing,we fit the model only with training data and then with restpect to this transforming both training and testing data.Now suppose I want to do this with word2vec.Should I train the word2vec model only with X_train with codes like the following? our_model = Word2Vec(X_Train, size=10, window=5, min_count=1, workers=4)?? But my question is how would I transform this vectorization with test data? There is no such method '.transform' in Gensim like Scikitlearn. Probably for Doc2vec, there is a method called 'infer_vector'. Another question is after vectorization of the whole corpus with Gensim's Word2vec how can I get the complete vector space representation of total vocabulary? Is there any method in Gensim's Word2vec? It will be very helpful if you kindly provide python code implementation to train our own embeddings using word2vec after using train-test split.	Anonymous	Mar 15, 2021
Printed	Page 299 First line of the page	The page starts with “but they’re not not Twitter or SMTD specific.” I think not is duplicated and it should only appear once.	Rodrigo Ávila	Jan 09, 2023