November 2019
Intermediate to advanced
346 pages
9h 36m
English
We often find in data science that the objects we wish to analyze are textual. For example, they might be tweets, articles, or network logs. Since our algorithms require numerical inputs, we must find a way to convert such text into numerical features. To this end, we utilize a sequence of techniques.
A token is a unit of text. For example, we may specify that our tokens are words, sentences, or characters. A count vectorizer takes textual input and then outputs a vector consisting of the counts of the textual tokens. A hashing vectorizer is a variation on the count vectorizer that sets out to be faster and more scalable, at the cost of interpretability and ...