book

Machine Learning Pocket Reference

by Matt Harrison

August 2019

Intermediate to advanced

318 pages

4h 40m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

What to ExpectWho This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Libraries UsedInstallation with PipInstallation with Conda
Project Layout SuggestionImportsAsk a QuestionTerms for DataGather DataClean DataCreate FeaturesSample DataImpute DataNormalize DataRefactorBaseline ModelVarious FamiliesStackingCreate ModelEvaluate ModelOptimize ModelConfusion MatrixROC CurveLearning CurveDeploy Model
Examining Missing DataDropping Missing DataImputing DataAdding Indicator Columns
Column NamesReplacing Missing Values
Data SizeSummary StatsHistogramScatter PlotJoint PlotPair GridBox and Violin PlotsComparing Two Ordinal ValuesCorrelationRadVizParallel Coordinates
StandardizeScale to RangeDummy VariablesLabel EncoderFrequency EncodingPulling Categories from StringsOther Categorical EncodingDate Feature EngineeringAdd col_na FeatureManual Feature Engineering
Collinear ColumnsLasso RegressionRecursive Feature EliminationMutual InformationPrincipal Component AnalysisFeature Importance
Use a Different MetricTree-based Algorithms and EnsemblesPenalize ModelsUpsampling MinorityGenerate Minority DataDownsampling MajorityUpsampling Then Downsampling

Logistic RegressionNaive BayesSupport Vector MachineK-Nearest NeighborDecision TreeRandom ForestXGBoostGradient Boosted with LightGBMTPOT
Validation CurveLearning Curve
Confusion MatrixMetricsAccuracyRecallPrecisionF1Classification ReportROCPrecision-Recall CurveCumulative Gains PlotLift CurveClass BalanceClass Prediction ErrorDiscrimination Threshold
Regression CoefficientsFeature ImportanceLIMETree InterpretationPartial Dependence PlotsSurrogate ModelsShapley
Baseline ModelLinear RegressionSVMsK-Nearest NeighborDecision TreeRandom ForestXGBoost RegressionLightGBM Regression
MetricsResiduals PlotHeteroscedasticityNormal ResidualsPrediction Error Plot
Shapley
PCAUMAPt-SNEPHATE
K-MeansAgglomerative (Hierarchical) ClusteringUnderstanding Clusters
Classification PipelineRegression PipelinePCA Pipeline

Content preview from Machine Learning Pocket Reference

Chapter 14. Regression

Regression is a supervised machine learning process. It is similar to classification, but rather than predicting a label, we try to predict a continuous value. If you are trying to predict a number, then use regression.

It turns out that sklearn supports many of the same classification models for regression problems. In fact, the API is the same, calling .fit, .score, and .predict. This is also true for the next-generation boosting libraries, XGBoost and LightGBM.

Though there are similarities with the classification models and hyperparameters, the evaluation metrics are different for regression. This chapter will review many of the types of regression models. We will use the Boston housing dataset to explore them.

Here we load the data, create a split version for training and testing, and create another split version with standardized data:

>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> from sklearn import (
...     model_selection,
...     preprocessing,
... )
>>> b = load_boston()
>>> bos_X = pd.DataFrame(
...     b.data, columns=b.feature_names
... )
>>> bos_y = b.target

>>> bos_X_train, bos_X_test, bos_y_train, bos_y_test = model_selection.train_test_split(
...     bos_X,
...     bos_y,
...     test_size=0.3,
...     random_state=42,
... )


>>> bos_sX = preprocessing.StandardScaler().fit_transform(
...     bos_X
... )
>>> bos_sX_train, bos_sX_test, bos_sy_train, bos_sy_test = model_selection.train_test_split(
...     bos_sX,
...     bos_y,
...     test_size ...