December 2018
Beginner to intermediate
684 pages
21h 9m
English
In this chapter, we explored numerous techniques and options to process unstructured data with the goal of extracting semantically meaningful, numerical features for use in machine learning models.
We covered the basic tokenization and annotation pipeline and illustrated its implementation for multiple languages using spaCy and TextBlob. We built on these results to create a document model based on the bag-of-words model to represent documents as numerical vectors. We learned how to refine the preprocessing pipeline and then used vectorized text data for classification and sentiment analysis.
In the remaining two chapters on alternative text data, we will learn how to summarize text using unsupervised learning to identify latent topics ...