Chapter 8. Blogs et al.: Natural Language Processing (and Beyond)

This chapter is a modest attempt to introduce Natural Language Processing (NLP) and apply it to the unstructured data in blogs. In the spirit of the prior chapters, it attempts to present the minimal level of detail required to empower you with a solid general understanding of an inherently complex topic, while also providing enough of a technical drill-down that you’ll be able to immediately get to work mining some data. Although we’ve been regularly cutting corners and taking a Pareto-like approach—giving you the crucial 20% of the skills that you can use to do 80% of the work—the corners we’ll cut in this chapter are especially pronounced because NLP is just that complex. No chapter out of any book—or any small multivolume set of books, for that matter, could possibly do it justice. This chapter is a pragmatic introduction that’ll give you enough information to do some pretty amazing things, like automatically generating abstracts from documents and extracting lists of important entities, but we will not journey very far into topics that would require multiple dissertations to sort out.

Although it’s not absolutely necessary that you have read Chapter 7 before you dive into this chapter, it’s highly recommended that you do so. A good understanding of Natural Language Processing presupposes an appreciation and working knowledge of some of the fundamental strengths and weaknesses of TF-IDF, vector space models, etc. ...

Get Mining the Social Web now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.