O'Reilly logo

Apache Spark 2.x for Java Developers by Sumit Kumar, Sourav Gulati

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Feature extractors

When the data present in a raw dataframe are not explicitly present in the form an ML algorithm expects we use feature extractors to extract those features. Common feature extractors are:

  • CountVectorizer: A CountVectorizer converts a collection of text documents into a vector representing the word count of text documents. CountVectorizer works in two different ways depending how the value of the dictionary gets populated. Let's first assume that the user has no prior information of the type of data that will populate the dataset of text; in such a scenario the dictionary gets prepared based on occurrence of term frequency across the dataset. VocabSize defines the number of words a dictionary can hold while the optional ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required