Chapter 5: Create a Modeling Data Set

Overview

ETL

Extract

APIs

Web Scraping

Open Source Data Sets

Data Set

Get the Data

Reduce the Size of the Data

Create a Target Variable

Creating TRAIN and TEST Data Sets

Variable Selection

Transform

Load

Chapter Review

Overview

It is commonly estimated that at least 80% of a data scientist’s effort is exerted in the extract, transform, load (ETL) stage of model development. This is a critical stage of model development that is often overlooked because it is not as exciting as applying a range of awesome algorithms to your data and evaluating your model’s performance. The ETL process is critical for quality model development because of the GIGO rule: Garbage In, Garbage Out.

Nearly all data sets need to ...

Get End-to-End Data Science with SAS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.