Skip to Main Content
Data Science Using Python and R
book

Data Science Using Python and R

by Chantal D. Larose, Daniel T. Larose
April 2019
Beginner to intermediate content levelBeginner to intermediate
240 pages
6h 47m
English
Wiley
Content preview from Data Science Using Python and R

Chapter 5PREPARING TO MODEL THE DATA

5.1 THE STORY SO FAR

To recapitulate our progress thus far, we are working our way through the Data Science Methodology.

  1. In Chapter 3, we discussed the importance of the Problem Understanding Phase.
  2. Also in Chapter 3, we dealt with several issues regarding the Data Preparation Phase.
  3. In Chapter 4, we covered some important topics in the Exploratory Data Analysis Phase.
  4. Now, here in Chapter 5, we are ready to tackle the Setup Phase.

The Setup Phase consists of a number of very important tasks that must be completed before we can begin our data modeling. These include:

  • Partitioning the data
  • Validating the data partition
  • Balancing the data
  • Establishing baseline model performance

We cover each of these topics in turn in this chapter.

5.2 PARTITIONING THE DATA

The Data Science Methodology does not use the statistical inference paradigm where generalization is made from a sample to a population. There are two reasons for this.

  1. Applying statistical inference to the huge sample sizes encountered in data science tends to result in statistical significance, even when the results are not of practical significance.
  2. In the statistical paradigm, the statistician has an a priori hypothesis in mind, whereas the Data Science Methodology requires no such a priori hypothesis, instead freely searching through the data for actionable results.

Because of the lack of a priori hypotheses, data scientists need to beware of data dredging, whereby phantom ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Practical Data Science with Python 3: Synthesizing Actionable Insights from Data

Practical Data Science with Python 3: Synthesizing Actionable Insights from Data

Ervin Varga
Python Data Science Essentials - Third Edition

Python Data Science Essentials - Third Edition

Alberto Boschetti, Luca Massaron, Pietro Marinelli, Matteo Malosetti

Publisher Resources

ISBN: 9781119526810Purchase book