O'Reilly logo

Mastering Machine Learning with Spark 2.x by Michal Malohlava, Max Pumperla, Alex Tellez

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Applying word2vec and exploring our data with vectors

Now that you have a good understanding of word2vec, doc2vec, and the incredible power of word vectors, it's time we turned our focus to our original IMDB dataset, whereby we will perform the following preprocessing:

  • Split words in each movie review by a space
  • Remove punctuation
  • Remove stopwords and all alphanumeric words
  • Using our tokenization function from the previous chapter, we will end with an array of comma-separated words
Because we have already covered the preceding steps in Chapter 4, Predicting Movie Reviews Using NLP and Spark Streaming, we'll quickly reproduce them in this section.

As usual, we begin with starting the Spark shell, which is our working environment:

export ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required