December 2018
Beginner to intermediate
684 pages
21h 9m
English
A key goal in using ML from text data for algorithmic trading is to extract signals from documents. A document is an individual sample from a relevant text data source, such as a company report, a headline or news article, or a tweet. A corpus, in turn, is a collection of documents (plural: corpora).
The following diagram lays out the key steps to convert documents into a dataset that can be used to train a supervised ML algorithm capable of making actionable predictions:

Fundamental techniques extract text features semantic units called tokens, and use linguistic rules and dictionaries to enrich these tokens with linguistic ...