O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

A note about stemming

A common step in text processing and tokenization is stemming. This is the conversion of whole words to a base form (called a word stem). For example, plurals might be converted to singular (dogs becomes dog), and forms such as walking and walker might become walk. Stemming can become quite complex and is typically handled with specialized NLP or search engine software (such as NLTK, OpenNLP, and Lucene, for example). We will ignore stemming for the purpose of our example here.

A full treatment of stemming is beyond the scope of this book. You can find more details at http://en.wikipedia.org/wiki/Stemming.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required