book

Machine Learning Pocket Reference

by Matt Harrison

August 2019

Intermediate to advanced

318 pages

4h 40m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

What to ExpectWho This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Libraries UsedInstallation with PipInstallation with Conda
Project Layout SuggestionImportsAsk a QuestionTerms for DataGather DataClean DataCreate FeaturesSample DataImpute DataNormalize DataRefactorBaseline ModelVarious FamiliesStackingCreate ModelEvaluate ModelOptimize ModelConfusion MatrixROC CurveLearning CurveDeploy Model
Examining Missing DataDropping Missing DataImputing DataAdding Indicator Columns
Column NamesReplacing Missing Values
Data SizeSummary StatsHistogramScatter PlotJoint PlotPair GridBox and Violin PlotsComparing Two Ordinal ValuesCorrelationRadVizParallel Coordinates
StandardizeScale to RangeDummy VariablesLabel EncoderFrequency EncodingPulling Categories from StringsOther Categorical EncodingDate Feature EngineeringAdd col_na FeatureManual Feature Engineering
Collinear ColumnsLasso RegressionRecursive Feature EliminationMutual InformationPrincipal Component AnalysisFeature Importance
Use a Different MetricTree-based Algorithms and EnsemblesPenalize ModelsUpsampling MinorityGenerate Minority DataDownsampling MajorityUpsampling Then Downsampling

Logistic RegressionNaive BayesSupport Vector MachineK-Nearest NeighborDecision TreeRandom ForestXGBoostGradient Boosted with LightGBMTPOT
Validation CurveLearning Curve
Confusion MatrixMetricsAccuracyRecallPrecisionF1Classification ReportROCPrecision-Recall CurveCumulative Gains PlotLift CurveClass BalanceClass Prediction ErrorDiscrimination Threshold
Regression CoefficientsFeature ImportanceLIMETree InterpretationPartial Dependence PlotsSurrogate ModelsShapley
Baseline ModelLinear RegressionSVMsK-Nearest NeighborDecision TreeRandom ForestXGBoost RegressionLightGBM Regression
MetricsResiduals PlotHeteroscedasticityNormal ResidualsPrediction Error Plot
Shapley
PCAUMAPt-SNEPHATE
K-MeansAgglomerative (Hierarchical) ClusteringUnderstanding Clusters
Classification PipelineRegression PipelinePCA Pipeline

Content preview from Machine Learning Pocket Reference

Chapter 19. Pipelines

Scikit-learn uses the notion of a pipeline. Using the Pipeline class, you can chain together transformers and models, and treat the whole process like a scikit-learn model. You can even insert custom logic.

Classification Pipeline

Here is an example using the tweak_titanic function inside of a pipeline:

>>> from sklearn.base import (
...     BaseEstimator,
...     TransformerMixin,
... )
>>> from sklearn.pipeline import Pipeline

>>> def tweak_titanic(df):
...     df = df.drop(
...         columns=[
...             "name",
...             "ticket",
...             "home.dest",
...             "boat",
...             "body",
...             "cabin",
...         ]
...     ).pipe(pd.get_dummies, drop_first=True)
...     return df

>>> class TitanicTransformer(
...     BaseEstimator, TransformerMixin
... ):
...     def transform(self, X):
...         # assumes X is output
...         # from reading Excel file
...         X = tweak_titanic(X)
...         X = X.drop(column="survived")
...         return X
...
...     def fit(self, X, y):
...         return self

>>> pipe = Pipeline(
...     [
...         ("titan", TitanicTransformer()),
...         ("impute", impute.IterativeImputer()),
...         (
...             "std",
...             preprocessing.StandardScaler(),
...         ),
...         ("rf", RandomForestClassifier()),
...     ]
... )

With a pipeline in hand, we can call .fit and .score on it:

>>> from sklearn.model_selection import (
...     train_test_split,
... )
>>> X_train2, X_test2, y_train2, y_test2 = train_test_split(
...     orig_df,
...     orig_df.survived,
...     test_size=0.3,
...     random_state=42,
... )

>>> pipe.fit(X_train2, y_train2)
>>> pipe.score(X_test2 ...