Chapter 17. Dimensionality Reduction

There are many techniques to decompose features into a smaller subset. This can be useful for exploratory data analysis, visualization, making predictive models, or clustering.

In this chapter we will explore the Titanic dataset using various techniques. We will look at PCA, UMAP, t-SNE, and PHATE.

Here is the data:

>>> ti_df = tweak_titanic(orig_df)
>>> std_cols = "pclass,age,sibsp,fare".split(",")
>>> X_train, X_test, y_train, y_test = get_train_test_X_y(
...     ti_df, "survived", std_cols=std_cols
... )
>>> X = pd.concat([X_train, X_test])
>>> y = pd.concat([y_train, y_test])

PCA

Principal Component Analysis (PCA) takes a matrix (X) of rows (samples) and columns (features). PCA returns a new matrix that has columns that are linear combinations of the original columns. These linear combinations maximize the variance.

Each column is orthogonal (a right angle) to the other columns. The columns are sorted in order of decreasing variance.

Scikit-learn has an implementation of this model. It is best to standardize the data prior to running the algorithm. After calling the .fit method, you will have access to an .explained_variance_ratio_ attribute that lists the percentage of variance in each column.

PCA is useful to visualize data in two (or three) dimensions. It is also used as a preprocessing step to filter out random noise in data. It is good for finding global structures, but not local ones, and works well with linear data.

In this example, ...

Get Machine Learning Pocket Reference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.