In Chapter 4, *Obtaining, Processing, and Preparing Data with Spark*, we looked at vector representation, where text features are mapped to a simple binary vector called the **bag-of-words** model. Another representation used commonly in practice is called Term Frequency-Inverse Document Frequency.

tf-idf weights each term in a piece of text (referred to as a **document**) based on its frequency in the document (the **term frequency**). A global normalization, called the **inverse document frequency**, is then applied based on the frequency of this term among all documents (the set of documents in a dataset is commonly referred to as a **corpus**). The standard definition of tf-idf is shown here:

*tf-idf(t,d) = tf(t,d) x idf(t)*

Here, *tf(t,d) ...*