Spark MLlib has built-in methods for regression. To be able to use the built-in methods of Spark, you will have to install pyspark on your cluster (standalone or distributed cluster). The installation can be done using the following:
pip install pyspark
The MLlib library has the following regression methods:
- Linear regression: We already learned about linear regression in earlier chapters; we can use this method using the LinearRegression class defined at pyspark.ml.regression. By default, it uses minimized squared error with regularization. It supports L1 and L2 regularization, and a combination of them.
- Generalized linear regression: The Spark MLlib has a subset of exponential family distributions like Gaussian, Poissons, ...