Feature extractors

When the data present in a raw dataframe are not explicitly present in the form an ML algorithm expects we use feature extractors to extract those features. Common feature extractors are:

  • CountVectorizer: A CountVectorizer converts a collection of text documents into a vector representing the word count of text documents. CountVectorizer works in two different ways depending how the value of the dictionary gets populated. Let's first assume that the user has no prior information of the type of data that will populate the dataset of text; in such a scenario the dictionary gets prepared based on occurrence of term frequency across the dataset. VocabSize defines the number of words a dictionary can hold while the optional ...

Get Apache Spark 2.x for Java Developers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.