StopWordsRemover
Stop words are words that should be excluded from the input, typically because the words appear frequently and don't carry as much meaning. Spark's StopWordsRemover takes as input a sequence of strings, which is tokenized by Tokenizer or RegexTokenizer. Then, it removes all the stop words from the input sequences. The list of stop words is specified by the stopWords parameter. The current implementation for the StopWordsRemover API provides the options for the Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, Turkish, and English languages. To provide an example, we can simply extend the preceding Tokenizer example in the previous section, since they are already ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access