Skip to Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 27. Linear Regression

This chapter presents a very important statistical concept, linear regression,1 which has many uses, including clinical applications such as genome analysis using patient sample data. According to Wikipedia: “Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.” Implementing linear regression for small data is very straightforward: we can use many existing Java classes, such as SimpleRegression from Apache Commons.2 However, these classes and packages can not handle a huge amount of data due to the limited memory and CPU resources in a single server. Our primary goal in this chapter is to implement linear regression for huge data sets (such as genomic data represented by biosets for many patients’ sample data).

This chapter provides two distinct MapReduce/Hadoop solutions for linear regression:

  • The first solution utilizes Apache Commons’s SimpleRegression.

  • The second solution implements MapReduce by using R’s linear model.

Spark provides the Machine Learning Library package, or MLlib, which includes linear methods (MLlib is under active development).

The most common form of linear regression is least squares fitting. Before getting into the details of implementing linear regression, let’s define what it is and what it tells us. In simple terms, we are trying to fit an equation to a real set of ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Graph Algorithms

Graph Algorithms

Mark Needham, Amy E. Hodler
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content