Chapter 8. Building a text analysis toolkit

This chapter covers
  • A brief introduction to Lucene
  • Understanding tokenizers, TokenStream, and analyzers
  • Building an analyzer to detect phrases and inject synonyms
  • Use cases for leveraging the infrastructure

It’s now common for most applications to leverage user-generated-content (UGC). Users may generate content through one of many ways: writing blog entries, sending messages to others, answering or posing questions on message boards, through journal entries, or by creating a list of related items. In chapter 3, we looked at the use of tagging to represent metadata associated with content. We mentioned that tags can also be detected by automated algorithm.

In this chapter, we build a toolkit to analyze ...

Get Collective Intelligence in Action now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.