Chapter 6. Principal Component Analysis
Principal component analysis, or PCA, is one of the minor miracles of machine learning. It’s a dimensionality reduction technique that reduces the number of dimensions in a dataset without sacrificing a commensurate amount of information. While that might seem underwhelming on the face of it, it has profound implications for engineers and software developers working to build predictive models from their data.
What if I told you that you could take a dataset with 1,000 columns, use PCA to reduce it to 100 columns, and retain 90% or more of the information in the original dataset? That’s relatively common, believe it or not. And it lends itself to a variety of practical uses, including:
-
Reducing high-dimensional data to two or three dimensions so that it can be plotted and explored
-
Reducing the number of dimensions in a dataset and then restoring the original number of dimensions, which finds application in anomaly detection and noise filtering
-
Anonymizing datasets so that they can be shared with others without revealing the nature or meaning of the data
And that’s not all. A side effect of applying PCA to a dataset is that less important features—columns of data that have less relevance to the outcome of a predictive model—are removed, while dependencies between columns is eliminated. And in datasets with a low ratio of samples (rows) to features (columns), PCA can be used to increase that ratio. As a rule of thumb, you typically ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access