TF-IDF

The TF IDF formula gives the relative importance of a term in a corpus (list of documents), given by the following formula: 

Where:

  • tfi,j = number of occurence of i in j
  • dfi = number of documents containing i
  • N = total number of document
Consider a document that contains 1,000 words, wherein the word rat appears 3 times. The term frequency (TF) for rat is then (3/1000=) 0.003. Now, in 10,000 documents, the word cat appears in 1,000 of them. Therefore, the inverse document frequency (IDF) is calculated as log(10000/1000) = 1. Thus, the TF-IDF weight is the product of these quantities is 0.003 * 1 = 0.12.

The words or features in the ...

Get Machine Learning for Mobile now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.