Skip to Content
Data Science at the Command Line
book

Data Science at the Command Line

by Jeroen Janssens
October 2014
Beginner to intermediate
210 pages
4h 32m
English
O'Reilly Media, Inc.
Content preview from Data Science at the Command Line

Chapter 9. Modeling Data

In this chapter, we’ll perform the fourth step of the OSEMN model (and the last step to require a computer): modeling data. Generally speaking, to model data is to create an abstract or higher-level description of your data. Just like with creating visualizations, it’s like taking a step back from the individual data points.

Visualizations, on the one hand, are characterized by shapes, positions, and colors such that we can interpret them by looking at them. Models, on the other hand, are internally characterized by a bunch of numbers, which means that computers can use them, for example, to make predictions about new data points. (We can still visualize models so that we can try to understand them and see how they are performing.)

In this chapter, we’ll consider four common types of algorithms to model data:

  • Dimensionality reduction

  • Clustering

  • Regression

  • Classification

These four types of algorithms come from the field of machine learning. As such, we’re going to change our vocabulary a bit. Let’s assume that we have a CSV file, also known as a data set. Each row, except for the header, is considered to be a data point. For simplicity we assume that each column that contains numerical values is an input feature. If a data point also contains a nonnumerical field, such as the species column in the Iris data set, then that is known as the data point’s label.

The first two types of algorithms (dimensionality reduction and clustering) are most often ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Science with Java

Data Science with Java

Michael R. Brzustowicz
Data Wrangling with Python

Data Wrangling with Python

Jacqueline Kazil, Katharine Jarmul
Data Analytics with Hadoop

Data Analytics with Hadoop

Benjamin Bengfort, Jenny Kim
Data Science on AWS

Data Science on AWS

Chris Fregly, Antje Barth

Publisher Resources

ISBN: 9781491947845Supplemental ContentErrata Page