Chapter 10. Improving Models and Data Extraction

Sometimes, no matter how good an algorithm is, it just doesn’t work. Or worse, it doesn’t pick up anything. Data can be quite noisy, and sometimes it’s just about impossible to figure out what went wrong. This chapter focuses on improving what you already have by either selecting better features, or transforming your features into a new set. We do this by monitoring metrics as they relate to either cross-validations or production monitoring.

This chapter will be somewhat of a smorgasbord when it comes to improving your models. That is because there are many ways of fixing models.

The Problem with the Curse of Dimensionality

As we’ve talked about before, the curse of dimensionality is a big problem with distance-based machine learning algorithms. Generally speaking, as the number of dimensions increases, the average distance also goes up. Take, for instance, the case in Figure 10-1, where we see a perfect sphere centered at 0,0,0.

3d curse dimensionality
Figure 10-1. In the case of three dimensions, the average distance is 1 because it is perfect

Everything is fine in three dimensions, but what if we project only onto two dimensions? What ends up happening is quite illuminating (see Figure 10-2).

2d curse dimensionality
Figure 10-2. In this case of dimension = 2, the average distance ...

Get Thoughtful Machine Learning now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.