6.7. Term Extraction

If you have done some word and phrase analysis on Web sites for better search engine placement, you will be familiar with the job that this transformation task performs. The Term Extraction transformation is a tool to mine free-flowing text for word and phrase frequency. You can feed any text-based input stream into the transformation and it will output two columns: a text phrase and a statistical value for the phrase relative to the total input stream. The statistical values or scores that can be calculated can be as simple as a count of the frequency of the words and phrases, or they can be a little more complicated as the result of a formula named TFIDF score. The TFIDF acronym stands for Term Frequency and Inverse Document Frequency, and it is a formula designed to balance the frequency of the distinct words and phrases relative to the total text sampled. If you're interested, here's the formula:

TDIDF (of a term or phrase) = (frequency of term) * log((# rows in sample)/(# rows
with term or phrase))

The results generated by the Term Extraction transformation are based on internal algorithms and statistical models that are encapsulated in the component. You can't alter or gain any insight into this logic by examining the code. However, some of the core rules for how the logic breaks apart the text to determine word and phrase boundaries are documented in Books Online. What you can do is tweak some external settings and make adjustments to the extraction ...

Get Professional SQL Server™ 2005 Integration Services now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.