July 2017
Intermediate to advanced
360 pages
8h 26m
English
Natural language processing is a set of machine learning techniques that allows working with text documents, considering their internal structure and the distribution of words. In this chapter, we're going to discuss all common methods to collect texts, split them into atoms, and transform them into numerical vectors. In particular, we'll compare different methods to tokenize documents (separate each word), to filter them, to apply special transformations to avoid inflected or conjugated forms, and finally to build a common vocabulary. Using the vocabulary, it will be possible to apply different vectorization approaches to build feature vectors that can easily be used for classification or clustering ...
Read now
Unlock full access