In Chapter 2, Data Processing Pipeline Using Scala, we discussed different kinds of data types – continuous, discrete, and so on – and a couple of data cleaning methods. Now is the time to see what else we can do for cleaning and extraction of data. Spark provides many approaches for feature extraction and transformation: TF-IDF, Word2Vec,
StandardScaler, normalizer, and feature selection.
TF (short for term frequency) and IDF (short for inverted document
frequency). TF-IDF is specifically suited for text documents where we determine the discriminating power of a term in a document using this score. In Spark, TF can be calculated using
HashingTF, which is ...