book

Machine Learning Pocket Reference

by Matt Harrison

August 2019

Intermediate to advanced

318 pages

4h 40m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

What to ExpectWho This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Libraries UsedInstallation with PipInstallation with Conda
Project Layout SuggestionImportsAsk a QuestionTerms for DataGather DataClean DataCreate FeaturesSample DataImpute DataNormalize DataRefactorBaseline ModelVarious FamiliesStackingCreate ModelEvaluate ModelOptimize ModelConfusion MatrixROC CurveLearning CurveDeploy Model
Examining Missing DataDropping Missing DataImputing DataAdding Indicator Columns
Column NamesReplacing Missing Values
Data SizeSummary StatsHistogramScatter PlotJoint PlotPair GridBox and Violin PlotsComparing Two Ordinal ValuesCorrelationRadVizParallel Coordinates
StandardizeScale to RangeDummy VariablesLabel EncoderFrequency EncodingPulling Categories from StringsOther Categorical EncodingDate Feature EngineeringAdd col_na FeatureManual Feature Engineering
Collinear ColumnsLasso RegressionRecursive Feature EliminationMutual InformationPrincipal Component AnalysisFeature Importance
Use a Different MetricTree-based Algorithms and EnsemblesPenalize ModelsUpsampling MinorityGenerate Minority DataDownsampling MajorityUpsampling Then Downsampling

Logistic RegressionNaive BayesSupport Vector MachineK-Nearest NeighborDecision TreeRandom ForestXGBoostGradient Boosted with LightGBMTPOT
Validation CurveLearning Curve
Confusion MatrixMetricsAccuracyRecallPrecisionF1Classification ReportROCPrecision-Recall CurveCumulative Gains PlotLift CurveClass BalanceClass Prediction ErrorDiscrimination Threshold
Regression CoefficientsFeature ImportanceLIMETree InterpretationPartial Dependence PlotsSurrogate ModelsShapley
Baseline ModelLinear RegressionSVMsK-Nearest NeighborDecision TreeRandom ForestXGBoost RegressionLightGBM Regression
MetricsResiduals PlotHeteroscedasticityNormal ResidualsPrediction Error Plot
Shapley
PCAUMAPt-SNEPHATE
K-MeansAgglomerative (Hierarchical) ClusteringUnderstanding Clusters
Classification PipelineRegression PipelinePCA Pipeline

Content preview from Machine Learning Pocket Reference

Chapter 7. Preprocess Data

This chapter will explore common preprocessing steps using this data:

>>> X2 = pd.DataFrame(
...     {
...         "a": range(5),
...         "b": [-100, -50, 0, 200, 1000],
...     }
... )
>>> X2
   a     b
0  0  -100
1  1   -50
2  2     0
3  3   200
4  4  1000

Standardize

Some algorithms, such as SVM, perform better when the data is standardized. Each column should have a mean value of 0 and standard deviation of 1. Sklearn provides a .fit_transform method that combines both .fit and .transform:

>>> from sklearn import preprocessing
>>> std = preprocessing.StandardScaler()
>>> std.fit_transform(X2)
array([[-1.41421356, -0.75995002],
       [-0.70710678, -0.63737744],
       [ 0.        , -0.51480485],
       [ 0.70710678, -0.02451452],
       [ 1.41421356,  1.93664683]])

After fitting, there are various attributes we can inspect:

>>> std.scale_
array([  1.41421356, 407.92156109])
>>> std.mean_
array([  2., 210.])
>>> std.var_
array([2.000e+00, 1.664e+05])

Here is a pandas version. Remember that you will need to track the original mean and standard deviation if you use this for preprocessing. Any sample that you will use to predict later will need to be standardized with those same values:

>>> X_std = (X2 - X2.mean()) / X2.std()
>>> X_std
          a         b
0 -1.264911 -0.679720
1 -0.632456 -0.570088
2  0.000000 -0.460455
3  0.632456 -0.021926
4  1.264911  1.732190

>>> X_std.mean()
a    4.440892e-17
b    0.000000e+00
dtype: float64

>>> X_std.std()
a    1.0
b    1.0
dtype: float64

The fastai library also implements this:

>>> X3 = X2.copy()
>>> from fastai.structured ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Practical Simulations for Machine Learning

Paris Buttfield-Addison, Mars Buttfield-Addison, Tim Nugent, Jon Manning

Interpretable Machine Learning with Python

Serg Masís

Training Data for Machine Learning

Anthony Sarkis

Machine Learning Bookcamp

Alexey Grigoriev

Publisher Resources

ISBN: 9781492047537Errata Page Supplemental Content