July 2017
Beginner to intermediate
715 pages
17h 3m
English
Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.
There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory instance:
public TweetHandler removeStopWords() { TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; tokenizerFactory = ...Read now
Unlock full access