Skip to Main Content
Data Science at the Command Line, 2nd Edition
book

Data Science at the Command Line, 2nd Edition

by Jeroen Janssens
August 2021
Beginner to intermediate content levelBeginner to intermediate
280 pages
6h 12m
English
O'Reilly Media, Inc.
Content preview from Data Science at the Command Line, 2nd Edition

Chapter 9. Modeling Data

In this chapter we’re going to perform the fourth step of the OSEMN model: modeling data. Generally speaking, a model is an abstract or higher-level description of your data. Modeling is a bit like creating visualizations in the sense that we’re taking a step back from the individual data points to see the bigger picture.

Visualizations are characterized by shapes, positions, and colors: we can interpret them by looking at them. Models, on the other hand, are internally characterized by numbers, which means that computers can use them to do things like make predictions about new data points. (We can still visualize models so that we can try to understand them and see how they are performing.)

In this chapter I’ll consider three types of algorithms commonly used to model data:

  • Dimensionality reduction

  • Regression

  • Classification

These algorithms come from the field of statistics and machine learning, so I’m going to change the vocabulary a bit. Let’s assume that I have a CSV file, also known as a dataset. Each row, except for the header, is considered a data point. Each data point has one or more features, or properties that have been measured. Sometimes a data point also has a label, which is, generally speaking, a judgment or outcome. This becomes more concrete when I introduce the wine dataset later in this chapter.

The first type of algorithm (dimensionality reduction) is most often unsupervised, which means that it creates a model based on the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python Data Science Handbook

Python Data Science Handbook

Jake VanderPlas

Publisher Resources

ISBN: 9781492087908Errata PageSupplemental Content