This chapter presents a very important statistical concept, linear regression,1 which has many uses, including clinical applications such as genome analysis using patient sample data. According to Wikipedia: “Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.” Implementing linear regression for small data is very straightforward: we can use many existing Java classes, such as
SimpleRegression from Apache Commons.2 However, these classes and packages can not handle a huge amount of data due to the limited memory and CPU resources in a single server. Our primary goal in this chapter is to implement linear regression for huge data sets (such as genomic data represented by biosets for many patients’ sample data).
This chapter provides two distinct MapReduce/Hadoop solutions for linear regression:
The first solution utilizes Apache Commons’s
The second solution implements MapReduce by using R’s linear model.
Spark provides the Machine Learning Library package, or MLlib, which includes linear methods (MLlib is under active development).
The most common form of linear regression is least squares fitting. Before getting into the details of implementing linear regression, let’s define what it is and what it tells us. In simple terms, we are trying to fit an equation to a real set of ...