Segmentation, annotation, and chunking

When the text is presented in digital form, it is relatively easy to find words as we can split the stream on non-word characters. This becomes more complex in spoken language analysis. In this case, segmenters try to optimize a metric, for example, to minimize the number of distinct words in the lexicon and the length or complexity of the phrase (Natural Language Processing with Python by Steven Bird et al, O'Reilly Media Inc, 2009).

Annotation usually refers to parts-of-speech tagging. In English, these are nouns, pronouns, verbs, adjectives, adverbs, articles, prepositions, conjunctions, and interjections. For example, in the phrase we saw the yellow dog, we is a pronoun, saw is a verb, the is an article, ...

Get Mastering Scala Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.