July 2017
Intermediate to advanced
796 pages
18h 55m
English
Tokenizer converts the input string into lowercase and then splits the string with whitespaces into individual tokens. A given sentence is split into words either using the default space delimiter or using a customer regular expression based Tokenizer. In either case, the input column is transformed into an output column. In particular, the input column is usually a String and the output column is a Sequence of Words.
Tokenizers are available by importing two packages shown next, the Tokenizer and the RegexTokenize:
import org.apache.spark.ml.feature.Tokenizerimport org.apache.spark.ml.feature.RegexTokenizer
First, you need to initialize a Tokenizer specifying the input column and the output column:
scala> val tokenizer = new ...
Read now
Unlock full access