Skip to Main Content
Data Science Using Python and R
book

Data Science Using Python and R

by Chantal D. Larose, Daniel T. Larose
April 2019
Beginner to intermediate content levelBeginner to intermediate
240 pages
6h 47m
English
Wiley
Content preview from Data Science Using Python and R

Chapter 12DIMENSION REDUCTION

12.1 THE NEED FOR DIMENSION REDUCTION

High dimensionality in data science refers to when there are a large number of predictors in the data set. For example, 100 predictors describe a 100‐dimensional space. So, why do we need dimension reduction in data science?

  1. Multicollinearity. Typically, large databases have many predictors. It is unlikely that all of these predictors are uncorrelated. Multicollinearity, which occurs when there is substantial correlation among the predictors, can lead to unstable regression models.
  2. Double‐Counting. Inclusion of predictors which are highly correlated tends to overemphasize a particular aspect of the model, that is, essentially double‐counting this aspect. For example, suppose we are trying to estimate the age of youngsters using math knowledge, height, and weight. Since height and weight are correlated, the model is essentially double‐counting the physical component of the youngster, as compared to the intellectual component.
  3. Curse of Dimensionality. As dimensionality increases, the volume of the predictor space grows exponentially, that is, faster than the number of predictors itself. Thus, even for huge sample sizes, the high‐dimension space is sparse. For example, the empirical rule states that about 68% of normally distributed data lies within one standard deviation of the mean. But, this is for one dimension. For 10 dimensions, only 2% of the data lies within the analogous hypersphere.
  4. Violates Parsimony. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Practical Data Science with Python 3: Synthesizing Actionable Insights from Data

Practical Data Science with Python 3: Synthesizing Actionable Insights from Data

Ervin Varga
Python Data Science Essentials - Third Edition

Python Data Science Essentials - Third Edition

Alberto Boschetti, Luca Massaron, Pietro Marinelli, Matteo Malosetti

Publisher Resources

ISBN: 9781119526810Purchase book