Skip to Content
Machine Learning Pocket Reference
book

Machine Learning Pocket Reference

by Matt Harrison
August 2019
Intermediate to advanced
318 pages
4h 40m
English
O'Reilly Media, Inc.
Book available
Content preview from Machine Learning Pocket Reference

Chapter 17. Dimensionality Reduction

There are many techniques to decompose features into a smaller subset. This can be useful for exploratory data analysis, visualization, making predictive models, or clustering.

In this chapter we will explore the Titanic dataset using various techniques. We will look at PCA, UMAP, t-SNE, and PHATE.

Here is the data:

>>> ti_df = tweak_titanic(orig_df)
>>> std_cols = "pclass,age,sibsp,fare".split(",")
>>> X_train, X_test, y_train, y_test = get_train_test_X_y(
...     ti_df, "survived", std_cols=std_cols
... )
>>> X = pd.concat([X_train, X_test])
>>> y = pd.concat([y_train, y_test])

PCA

Principal Component Analysis (PCA) takes a matrix (X) of rows (samples) and columns (features). PCA returns a new matrix that has columns that are linear combinations of the original columns. These linear combinations maximize the variance.

Each column is orthogonal (a right angle) to the other columns. The columns are sorted in order of decreasing variance.

Scikit-learn has an implementation of this model. It is best to standardize the data prior to running the algorithm. After calling the .fit method, you will have access to an .explained_variance_ratio_ attribute that lists the percentage of variance in each column.

PCA is useful to visualize data in two (or three) dimensions. It is also used as a preprocessing step to filter out random noise in data. It is good for finding global structures, but not local ones, and works well with linear data.

In this example, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Practical Simulations for Machine Learning

Practical Simulations for Machine Learning

Paris Buttfield-Addison, Mars Buttfield-Addison, Tim Nugent, Jon Manning

Publisher Resources

ISBN: 9781492047537Errata PageSupplemental Content