Now that you have a good understanding of word2vec, doc2vec, and the incredible power of word vectors, it's time we turned our focus to our original IMDB dataset, whereby we will perform the following preprocessing:
- Split words in each movie review by a space
- Remove punctuation
- Remove stopwords and all alphanumeric words
- Using our tokenization function from the previous chapter, we will end with an array of comma-separated words
As usual, we begin with starting the Spark shell, which is our working environment:
export ...