book

Hands-On Unsupervised Learning Using Python

by Ankur A. Patel

March 2019

Intermediate to advanced

359 pages

8h 46m

English

O'Reilly Media, Inc.

Read now

Unlock full access

A Brief History of Machine LearningAI Is Back, but Why Now?The Emergence of Applied AIMajor Milestones in Applied AI over the Past 20 YearsFrom Narrow AI to AGIObjective and ApproachPrerequisitesRoadmapConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Basic Machine Learning TerminologyRules-Based vs. Machine LearningSupervised vs. UnsupervisedThe Strengths and Weaknesses of Supervised LearningThe Strengths and Weaknesses of Unsupervised LearningUsing Unsupervised Learning to Improve Machine Learning SolutionsA Closer Look at Supervised AlgorithmsLinear MethodsNeighborhood-Based MethodsTree-Based MethodsSupport Vector MachinesNeural NetworksA Closer Look at Unsupervised AlgorithmsDimensionality ReductionClusteringFeature ExtractionUnsupervised Deep LearningSequential Data Problems Using Unsupervised LearningReinforcement Learning Using Unsupervised LearningSemisupervised LearningSuccessful Applications of Unsupervised LearningAnomaly DetectionConclusion
Environment SetupVersion Control: GitClone the Hands-On Unsupervised Learning Git RepositoryScientific Libraries: Anaconda Distribution of PythonNeural Networks: TensorFlow and KerasGradient Boosting, Version One: XGBoostGradient Boosting, Version Two: LightGBMClustering AlgorithmsInteractive Computing Environment: Jupyter NotebookOverview of the DataData PreparationData AcquisitionData ExplorationGenerate Feature Matrix and Labels ArrayFeature Engineering and Feature SelectionData VisualizationModel PreparationSplit into Training and Test SetsSelect Cost FunctionCreate k-Fold Cross-Validation SetsMachine Learning Models (Part I)Model #1: Logistic RegressionEvaluation MetricsConfusion MatrixPrecision-Recall CurveReceiver Operating CharacteristicMachine Learning Models (Part II)Model #2: Random ForestsModel #3: Gradient Boosting Machine (XGBoost)Model #4: Gradient Boosting Machine (LightGBM)Evaluation of the Four Models Using the Test SetEnsemblesStackingFinal Model SelectionProduction PipelineConclusion
The Motivation for Dimensionality ReductionThe MNIST Digits DatabaseDimensionality Reduction AlgorithmsLinear Projection vs. Manifold LearningPrincipal Component AnalysisPCA, the ConceptPCA in PracticeIncremental PCASparse PCAKernel PCASingular Value DecompositionRandom ProjectionGaussian Random ProjectionSparse Random ProjectionIsomapMultidimensional ScalingLocally Linear Embeddingt-Distributed Stochastic Neighbor EmbeddingOther Dimensionality Reduction MethodsDictionary LearningIndependent Component AnalysisConclusion
Credit Card Fraud DetectionPrepare the DataDefine Anomaly Score FunctionDefine Evaluation MetricsDefine Plotting FunctionNormal PCA Anomaly DetectionPCA Components Equal Number of Original DimensionsSearch for the Optimal Number of Principal ComponentsSparse PCA Anomaly DetectionKernel PCA Anomaly DetectionGaussian Random Projection Anomaly DetectionSparse Random Projection Anomaly DetectionNonlinear Anomaly DetectionDictionary Learning Anomaly DetectionICA Anomaly DetectionFraud Detection on the Test SetNormal PCA Anomaly Detection on the Test SetICA Anomaly Detection on the Test SetDictionary Learning Anomaly Detection on the Test SetConclusion
MNIST Digits DatasetData PreparationClustering Algorithmsk-Meansk-Means InertiaEvaluating the Clustering Resultsk-Means Accuracyk-Means and the Number of Principal Componentsk-Means on the Original DatasetHierarchical ClusteringAgglomerative Hierarchical ClusteringThe DendrogramEvaluating the Clustering ResultsDBSCANDBSCAN AlgorithmApplying DBSCAN to Our DatasetHDBSCANConclusion
Lending Club DataData PreparationTransform String Format to Numerical FormatImpute Missing ValuesEngineer FeaturesSelect Final Set of Features and Perform ScalingDesignate Labels for EvaluationGoodness of the Clustersk-Means ApplicationHierarchical Clustering ApplicationHDBSCAN ApplicationConclusion

Neural NetworksTensorFlowKerasAutoencoder: The Encoder and the DecoderUndercomplete AutoencodersOvercomplete AutoencodersDense vs. Sparse AutoencodersDenoising AutoencoderVariational AutoencoderConclusion
Data PreparationThe Components of an AutoencoderActivation FunctionsOur First AutoencoderLoss FunctionOptimizerTraining the ModelEvaluating on the Test SetTwo-Layer Undercomplete Autoencoder with Linear Activation FunctionIncreasing the Number of NodesAdding More Hidden LayersNonlinear AutoencoderOvercomplete Autoencoder with Linear ActivationOvercomplete Autoencoder with Linear Activation and DropoutSparse Overcomplete Autoencoder with Linear ActivationSparse Overcomplete Autoencoder with Linear Activation and DropoutWorking with Noisy DatasetsDenoising AutoencoderTwo-Layer Denoising Undercomplete Autoencoder with Linear ActivationTwo-Layer Denoising Overcomplete Autoencoder with Linear ActivationTwo-Layer Denoising Overcomplete Autoencoder with ReLu ActivationConclusion
Data PreparationSupervised ModelUnsupervised ModelSemisupervised ModelThe Power of Supervised and UnsupervisedConclusion
Boltzmann MachinesRestricted Boltzmann MachinesRecommender SystemsCollaborative FilteringThe Netflix PrizeMovieLens DatasetData PreparationDefine the Cost Function: Mean Squared ErrorPerform Baseline ExperimentsMatrix FactorizationOne Latent FactorThree Latent FactorsFive Latent FactorsCollaborative Filtering Using RBMsRBM Neural Network ArchitectureBuild the Components of the RBM ClassTrain RBM Recommender SystemConclusion
Deep Belief Networks in DetailMNIST Image ClassificationRestricted Boltzmann MachinesBuild the Components of the RBM ClassGenerate Images Using the RBM ModelView the Intermediate Feature DetectorsTrain the Three RBMs for the DBNExamine Feature DetectorsView Generated ImagesThe Full DBNHow Training of a DBN WorksTrain the DBNHow Unsupervised Learning Helps Supervised LearningGenerate Images to Build a Better Image ClassifierImage Classifier Using LightGBMSupervised OnlyUnsupervised and Supervised SolutionConclusion
GANs, the ConceptThe Power of GANsDeep Convolutional GANsConvolutional Neural NetworksDCGANs RevisitedGenerator of the DCGANDiscriminator of the DCGANDiscriminator and Adversarial ModelsDCGAN for the MNIST DatasetMNIST DCGAN in ActionSynthetic Image GenerationConclusion
ECG DataApproach to Time Series Clusteringk-ShapeTime Series Clustering Using k-Shape on ECGFiveDaysData PreparationTraining and EvaluationTime Series Clustering Using k-Shape on ECG5000Data PreparationTraining and EvaluationTime Series Clustering Using k-Means on ECG5000Time Series Clustering Using Hierarchical DBSCAN on ECG5000Comparing the Time Series Clustering AlgorithmsFull Run with k-ShapeFull Run with k-MeansFull Run with HDBSCANComparing All Three Time Series Clustering ApproachesConclusion
Supervised LearningUnsupervised LearningScikit-LearnTensorFlow and KerasReinforcement LearningMost Promising Areas of Unsupervised Learning TodayThe Future of Unsupervised LearningFinal Words

Content preview from Hands-On Unsupervised Learning Using Python

Chapter 9. Semisupervised Learning

Until now, we have viewed supervised learning and unsupervised learning as two separate and distinct branches of machine learning. Supervised learning is appropriate when our dataset is labeled, and unsupervised learning is necessary when our dataset is unlabeled.

In the real world, the distinction is not quite so clear. Datasets are usually partially labeled, and we want to efficiently label the unlabeled observations while leveraging the information in the labeled set. With supervised learning, we would have to toss away the majority of the dataset because it is unlabeled. With unsupervised learning, we would have the majority of the data to work with but would not know how to take advantage of the few labels we have.

The field of semisupervised learning blends the benefits of both supervised and unsupervised learning, taking advantage of the few labels that are available to uncover structure in a dataset and help label the rest.

We will continue to use the credit card transactions dataset in this chapter to showcase semisupervised learning.

Data Preparation

As before, let’s load in the necessary libraries and prepare the data. This should be pretty familiar by now:

'''Main'''
import numpy as np
import pandas as pd
import os, time, re
import pickle, gzip

'''Data Viz'''
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl

%matplotlib inline

'''Data Prep and Model Evaluation'''
from sklearn ...