Applying word2vec and exploring our data with vectors

Now that you have a good understanding of word2vec, doc2vec, and the incredible power of word vectors, it's time we turned our focus to our original IMDB dataset, whereby we will perform the following preprocessing:

  • Split words in each movie review by a space
  • Remove punctuation
  • Remove stopwords and all alphanumeric words
  • Using our tokenization function from the previous chapter, we will end with an array of comma-separated words
Because we have already covered the preceding steps in Chapter 4, Predicting Movie Reviews Using NLP and Spark Streaming, we'll quickly reproduce them in this section.

As usual, we begin with starting the Spark shell, which is our working environment:

export ...

Get Mastering Machine Learning with Spark 2.x now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.