Text Mining Fundamentals
Although rigorous approaches to natural language processing (NLP) that include such things as sentence segmentation, tokenization, word chunking, and entity detection are necessary in order to achieve the deepest possible understanding of textual data, it’s helpful to first introduce some fundamentals from Information Retrieval theory. The remainder of this chapter introduces some of its more foundational aspects, including TF-IDF, the cosine similarity metric, and some of the theory behind collocation detection. Chapter 8 provides a deeper discussion of NLP.
Note
If you want to dig deeper into IR theory, the full text of Introduction to Information Retrieval is available online and provides more information than you could ever want to know about the field.
A Whiz-Bang Introduction to TF-IDF
Information retrieval is an extensive field with many specialties.
This discussion narrows in on TF-IDF, one of the most fundamental
techniques for retrieving relevant documents from a corpus. TF-IDF
stands for term frequency-inverse document
frequency and can be used to query a corpus by calculating
normalized scores that express the relative importance of terms in the
documents. Mathematically, TF-IDF is expressed as the product of the
term frequency and the inverse document frequency, tf_idf =
tf*idf, where the term tf
represents the importance of a term in a specific document, and idf
represents the importance of a term relative to the entire corpus. Multiplying these ...
Get Mining the Social Web now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.