In Chapter 4, Obtaining, Processing, and Preparing Data with Spark, we looked at vector representation, where text features are mapped to a simple binary vector called the bag-of-words model. Another representation used commonly in practice is called Term Frequency-Inverse Document Frequency.
tf-idf weights each term in a piece of text (referred to as a document) based on its frequency in the document (the term frequency). A global normalization, called the inverse document frequency, is then applied based on the frequency of this term among all documents (the set of documents in a dataset is commonly referred to as a corpus). The standard definition of tf-idf is shown here:
tf-idf(t,d) = tf(t,d) x idf(t)
Here, tf(t,d) ...