Preparing the data

Preparing the text is a task in its own right. This is because in the real world, text is often messy and cannot be fixed with a few simple scaling operations. For instance, people can often make typos after adding unnecessary characters as they are adding text encodings that we cannot read. NLP involves its own set of data cleaning challenges and techniques.

Sanitizing characters

To store text, computers need to encode the characters into bits. There are several different ways to do this, and not all of them can deal with all the characters out there.

It is good practice to keep all the text files in one encoding scheme, usually UTF-8, but of course, that does not always happen. Files might also be corrupted, meaning that a few ...

Get Machine Learning for Finance now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.