O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Applying basic tokenization

The first step in our text processing pipeline is to split up the raw text content in each document into a collection of terms (also referred to as tokens). This is known as tokenization. We will start by applying a simple whitespace tokenization, together with converting each token to lowercase for each document:

val text = rdd.map { case (file, text) => text } val whiteSpaceSplit = text.flatMap(t => t.split("   ").map(_.toLowerCase)) println(whiteSpaceSplit.distinct.count)
In the preceding code, we used the flatMap function instead of map, as for now, we want to inspect all the tokens together for exploratory analysis. Later in this chapter, we will apply our tokenization scheme on a per-document basis, so we ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required