Chapter 4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
A bag-of-words representation is simple to generate but far from perfect. If we count all words equally, then some words end up being emphasized more than we need. Recall our example of Emma and the raven from Chapter 3. We’d like a document representation that emphasizes the two main characters. The words “Emma” and “raven” both appear three times, but “the” appears a whopping eight times, “and” appears five times, and “it” and “was” both appear four times. The main characters do not stand out by simple frequency count alone. This is problematic.
It would also be nice to pick out words such as “magnificently,” “gleamed,” “intimidated,” “tentatively,” and “reigned,” because they help to set the overall tone of the paragraph. They indicate sentiment, which can be very valuable information to a data scientist. So, ideally, we’d like a representation that highlights meaningful words.
Tf-Idf : A Simple Twist on Bag-of-Words
Tf-idf is a simple twist on the bag-of-words approach. It stands for term frequency–inverse document frequency. Instead of looking at the raw counts of each word in each document in a dataset, tf-idf looks at a normalized count where each word count is divided by the number of documents this word appears in. That is:
bow(w, d) = # times word w appears in document d
tf-idf(w, d) = bow(w, d) * N / (# documents in which word w appears)
N is the total number of documents in the dataset. The ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access