TF-IDF
TF-IDF stands for term frequency-inverse document frequency, which measures how important a word is to a document in a collection of documents. It is used extensively in informational retrieval and reflects the weightage of the word in the document. The TF-IDF value increases in proportion to the number of occurrences of the words otherwise known as frequency of the word/term and consists of two key elements, the term frequency and the inverse document frequency.
TF is the term frequency, which is the frequency of a word/term in the document. For a term t, tf measures the number of times term t occurs in document d. tf is implemented in Spark using hashing where a term is mapped into an index by applying a hash function.
IDF is the ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access