Chapter 23. Pearson Correlation
Chapter 23. Introduction
The Pearson1 correlation measures how well two sets of data are related (linear relationship). It is the most common measure of correlation in mathematics and statistics. In a nutshell, the Pearson correlation answers this question: is it possible to draw a line graph to represent the data? According to onlinestatbook:
The Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. It is referred to as Pearson’s correlation or simply as the correlation coefficient. If the relationship between the variables is not linear, then the correlation coefficient does not adequately represent the strength of the relationship between the variables.
This chapter will provide two MapReduce solutions for the Pearson correlation:
A simple solution using classical MapReduce/Hadoop
A Spark implementation that will correlate all vs. all (defined shortly)
The algorithms presented for Pearson correlations can be easily adapted to Spearman ranked correlations. To perform a Spearman ranked correlation, I have provided a Java wrapper class called
Spearman.java. By the end of this chapter, you will be able to replace a Pearson correlation with Spearman’s ranked correlation algorithm.
Pearson Correlation Formula
The formula for the Pearson correlation can be written in many different equivalent forms. Let x = (x1, x2, ..., xn) and y = (y1, y2, ..., yn). Then the Pearson correlation ...