Correlation and linearity

For this task, we return to our old friend the caret package. We'll start by creating a correlation matrix, using the Spearman Rank method, then apply the findCorrelation() function for all correlations above 0.9:

df_corr <- cor(gettysburg_treated, method = "spearman")high_corr <- caret::findCorrelation(df_corr, cutoff = 0.9)
Why Spearman versus Pearson correlation? Spearman is free from any distribution assumptions and is robust enough for any task at hand: http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/.

The high_corr object is a list of integers that correspond to feature column numbers. Let's dig deeper into this:

high_corr

The output of the preceding code is as follows:

[1] 9 ...

Get Mastering Machine Learning with R - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.