Skip to Main Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced content levelIntermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 23. Pearson Correlation

Chapter 23. Introduction

The Pearson1 correlation measures how well two sets of data are related (linear relationship). It is the most common measure of correlation in mathematics and statistics. In a nutshell, the Pearson correlation answers this question: is it possible to draw a line graph to represent the data? According to onlinestatbook:

The Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. It is referred to as Pearson’s correlation or simply as the correlation coefficient. If the relationship between the variables is not linear, then the correlation coefficient does not adequately represent the strength of the relationship between the variables.

This chapter will provide two MapReduce solutions for the Pearson correlation:

  • A simple solution using classical MapReduce/Hadoop

  • A Spark implementation that will correlate all vs. all (defined shortly)

The algorithms presented for Pearson correlations can be easily adapted to Spearman ranked correlations. To perform a Spearman ranked correlation, I have provided a Java wrapper class called Spearman.java. By the end of this chapter, you will be able to replace a Pearson correlation with Spearman’s ranked correlation algorithm.

Pearson Correlation Formula

The formula for the Pearson correlation can be written in many different equivalent forms. Let x = (x1, x2, ..., xn) and y = (y1, y2, ..., yn). Then the Pearson correlation ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert
Data Mesh

Data Mesh

Zhamak Dehghani
Learning Algorithms

Learning Algorithms

George Heineman

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content