O'Reilly logo

Building a Recommendation Engine with Scala by Saleem Ansari

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Feature extraction and transformation

In Chapter 2, Data Processing Pipeline Using Scala, we discussed different kinds of data types – continuous, discrete, and so on – and a couple of data cleaning methods. Now is the time to see what else we can do for cleaning and extraction of data. Spark provides many approaches for feature extraction and transformation: TF-IDF, Word2Vec, StandardScaler, normalizer, and feature selection.

Term frequency-inverted document frequency (TF-IDF)

TF (short for term frequency) and IDF (short for inverted document frequency). TF-IDF is specifically suited for text documents where we determine the discriminating power of a term in a document using this score. In Spark, TF can be calculated using HashingTF, which is ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required