book

Machine Learning Pocket Reference

by Matt Harrison

August 2019

Intermediate to advanced

318 pages

4h 40m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

What to ExpectWho This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Libraries UsedInstallation with PipInstallation with Conda
Project Layout SuggestionImportsAsk a QuestionTerms for DataGather DataClean DataCreate FeaturesSample DataImpute DataNormalize DataRefactorBaseline ModelVarious FamiliesStackingCreate ModelEvaluate ModelOptimize ModelConfusion MatrixROC CurveLearning CurveDeploy Model
Examining Missing DataDropping Missing DataImputing DataAdding Indicator Columns
Column NamesReplacing Missing Values
Data SizeSummary StatsHistogramScatter PlotJoint PlotPair GridBox and Violin PlotsComparing Two Ordinal ValuesCorrelationRadVizParallel Coordinates
StandardizeScale to RangeDummy VariablesLabel EncoderFrequency EncodingPulling Categories from StringsOther Categorical EncodingDate Feature EngineeringAdd col_na FeatureManual Feature Engineering
Collinear ColumnsLasso RegressionRecursive Feature EliminationMutual InformationPrincipal Component AnalysisFeature Importance
Use a Different MetricTree-based Algorithms and EnsemblesPenalize ModelsUpsampling MinorityGenerate Minority DataDownsampling MajorityUpsampling Then Downsampling

Logistic RegressionNaive BayesSupport Vector MachineK-Nearest NeighborDecision TreeRandom ForestXGBoostGradient Boosted with LightGBMTPOT
Validation CurveLearning Curve
Confusion MatrixMetricsAccuracyRecallPrecisionF1Classification ReportROCPrecision-Recall CurveCumulative Gains PlotLift CurveClass BalanceClass Prediction ErrorDiscrimination Threshold
Regression CoefficientsFeature ImportanceLIMETree InterpretationPartial Dependence PlotsSurrogate ModelsShapley
Baseline ModelLinear RegressionSVMsK-Nearest NeighborDecision TreeRandom ForestXGBoost RegressionLightGBM Regression
MetricsResiduals PlotHeteroscedasticityNormal ResidualsPrediction Error Plot
Shapley
PCAUMAPt-SNEPHATE
K-MeansAgglomerative (Hierarchical) ClusteringUnderstanding Clusters
Classification PipelineRegression PipelinePCA Pipeline

Content preview from Machine Learning Pocket Reference

Chapter 9. Imbalanced Classes

If you are classifying data, and the classes are not relatively balanced in size, the bias toward more popular classes can carry over into your model. For example, if you have 1 positive case and 99 negative cases, you can get 99% accuracy simply by classifying everything as negative. There are various options for dealing with imbalanced classes.

Use a Different Metric

One hint is to use a measure other than accuracy (AUC is a good choice) for calibrating models. Precision and recall are also better options when the target sizes are different. However, there are other options to consider as well.

Tree-based Algorithms and Ensembles

Tree-based models may perform better depending on the distribution of the smaller class. If they tend to be clustered, they can be classified easier.

Ensemble methods can further aid in pulling out the minority classes. Bagging and boosting are options found in tree models like random forests and Extreme Gradient Boosting (XGBoost).

Penalize Models

Many scikit-learn classification models support the class_weight parameter. Setting this to 'balanced' will attempt to regularize minority classes and incentivize the model to classify them correctly. Alternatively, you can grid search and specify the weight options by passing in a dictionary mapping class to weight (give higher weight to smaller classes).

The XGBoost library has the max_delta_step parameter, which can be set from 1 to 10 to make the update step more conservative. ...