Summary
In this chapter, a complete ML pipeline was implemented, from collecting historical data, to transforming it into a format suitable for testing hypotheses, training ML models, and running a prediction on Live data, and with the possibility to evaluate many different models and select the best one.
The test results showed that, as in the original dataset, about 600,000 minutes out of 2.4 million can be classified as increasing price (close price was higher than open price); the dataset can be considered imbalanced. Although random forests are usually performed well on an imbalanced dataset, the area under the ROC curve of 0.74 isn't best. As we need to have fewer false positives (fewer times when we trigger purchase and the price drops), ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access