Using a TF-IDF model

While we often refer to training a TF-IDF model, it is actually a feature extraction process or transformation rather than a machine learning model. TF-IDF weighting is often used as a preprocessing step for other models, such as dimensionality reduction, classification, or regression.

To illustrate the potential uses of TF-IDF weighting, we will explore two examples. The first is using the TF-IDF vectors to compute document similarity, while the second involves training a multilabel classification model with the TF-IDF vectors as input features.

Document similarity with the 20 Newsgroups dataset and TF-IDF features

You might recall from Chapter 4, Building a Recommendation Engine with Spark, that the similarity between two vectors ...

Get Machine Learning with Spark now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.