O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Term weighting schemes

In Chapter 4, Obtaining, Processing, and Preparing Data with Spark, we looked at vector representation, where text features are mapped to a simple binary vector called the bag-of-words model. Another representation used commonly in practice is called Term Frequency-Inverse Document Frequency.

tf-idf weights each term in a piece of text (referred to as a document) based on its frequency in the document (the term frequency). A global normalization, called the inverse document frequency, is then applied based on the frequency of this term among all documents (the set of documents in a dataset is commonly referred to as a corpus). The standard definition of tf-idf is shown here:

tf-idf(t,d) = tf(t,d) x idf(t)

Here, tf(t,d) ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required