Chapter 9. Modeling Data

In this chapter we’re going to perform the fourth step of the OSEMN model: modeling data. Generally speaking, a model is an abstract or higher-level description of your data. Modeling is a bit like creating visualizations in the sense that we’re taking a step back from the individual data points to see the bigger picture.

Visualizations are characterized by shapes, positions, and colors: we can interpret them by looking at them. Models, on the other hand, are internally characterized by numbers, which means that computers can use them to do things like make predictions about new data points. (We can still visualize models so that we can try to understand them and see how they are performing.)

In this chapter I’ll consider three types of algorithms commonly used to model data:

  • Dimensionality reduction

  • Regression

  • Classification

These algorithms come from the field of statistics and machine learning, so I’m going to change the vocabulary a bit. Let’s assume that I have a CSV file, also known as a dataset. Each row, except for the header, is considered a data point. Each data point has one or more features, or properties that have been measured. Sometimes a data point also has a label, which is, generally speaking, a judgment or outcome. This becomes more concrete when I introduce the wine dataset later in this chapter.

The first type of algorithm (dimensionality reduction) is most often unsupervised, which means that it creates a model based on the ...

Get Data Science at the Command Line, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.