Artificial Intelligence for Big Data
by Anand Deshpande, Manish Kumar, Albenzo Coletta, Giancarlo Zaccone
TF-IDF
The TF-IDF method of feature extraction uses a scalar product of term frequency (TF) and inverse document frequency (IDF) to calculate the numerical vector of a token or term. TF-IDF not only calculates the importance of a word in a specific document but also measures its importance in other documents of a corpus. Moreover, it tries to normalize any word that is overly frequent in the entire corpus.
TF, or Term Frequency, is a term’s occurrence in a document. We can use the HashingTF library in Spark to compute the term's frequency. HashingTF creates the sparse vector corresponding to each document representing index and frequency. For example, if we consider the extraction of the feature using HashingTF extraction method text string, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access