5

LEXICONS, ENTITY EXTRACTION, AND GEOCODING

This book introduces many advanced topics in content analysis ranging from sentiment analysis to document clustering, but at the end of the day most projects rely on one of the simplest forms of analysis: the lookup file. Geocoding systems rely on massive databases of geographic locations and their latitude and longitude coordinates, known as gazetteers, while lexicons use lists of keywords to match into predefined categories, and entity extraction systems use a combination of context and word lists to compile lists of named entities appearing in a document. The theory behind such systems is simple: take each word or token and look for a match in the corresponding word list, but the complexities of ...

Get Data Mining Methods for the Content Analyst now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.