Chapter 7. Google+: TF-IDF, Cosine Similarity, and Collocations
Initial printings of this book from February 2011 through February 2012 featured Google Buzz as the backdrop for data in this chapter. This chapter has been fully revised (with as few changes made as possible) to now feature Google+ instead. Example files have been updated and renamed with the plus__ prefix, but previous buzz__ example files are still available online with the other example code.
This short chapter begins our journey into text mining, and it’s something of an inflection point in this book. Earlier chapters have mostly focused on analyzing structured or semi-structured data such as records encoded as microformats, relationships among people, or specially marked #hashtags in tweets. However, this chapter begins munging and making sense of textual information in documents by introducing Information Retrieval (IR) theory fundamentals such as TF-IDF, cosine similarity, and collocation detection. As you may have already inferred from the chapter title, Google+ initially serves as our primary source of data because it’s inherently social, easy to harvest, and has a lot of potential for the social web. Toward the end of this chapter, we’ll also look at what it takes to tap into your Gmail data. In the chapters ahead, we’ll investigate mining blog data and other sources of free text, as additional forms of text analytics such as entity extraction and the automatic generation of abstracts are introduced. ...