July 2017
Intermediate to advanced
796 pages
18h 55m
English
CountVectorizer is used to convert a collection of text documents to vectors of token counts essentially producing sparse representations for the documents over the vocabulary. The end result is a vector of features, which can then be passed to other algorithms. Later on, we will see how to use the output from the CountVectorizer in LDA algorithm to perform topic detection.
In order to invoke CountVectorizer, you need to import the package:
import org.apache.spark.ml.feature.CountVectorizer
First, you need to initialize a CountVectorizer Transformer specifying the input column and the output column. Here, we are choosing the filteredWords column created by the StopWordRemover and generate output column features:
scala> val ...
Read now
Unlock full access