O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Hands-On Unsupervised Learning Using Python

Book Description

Many industry experts consider unsupervised learning the next AI frontier, one that may hold the key to general artificial intelligence. Armed with the conceptual knowledge in this book, data scientists and machine learning practitioners will learn hands-on how to apply unsupervised learning to large unlabeled datasets using Python tools. You’ll uncover hidden patterns, gain deeper business insight, detect anomalies, perform automatic feature engineering and selection, and generate synthetic datasets.

Author Ankur Patel—an applied machine-learning researcher and data scientist with expertise in financial markets—provides the concepts, intuition, and tools necessary for you to apply this technology to problems you tackle every day. Through the course of this book, you’ll learn how to build production-ready systems with Python.

Chapters for this Early Release edition will be available as they are completed, including the first five chapters in this initial release:

  • Examine the difference between supervised and unsupervised learning, and the relative strengths and weaknesses of each
  • Set up and manage a machine learning project end-to-end—everything from data acquisition to building a model and implementing a solution in production
  • Explore dimensionality reduction algorithms that learn the underlying structure of a dataset’s most salient information
  • Build a credit card fraud detection system using dimensionality reduction methods

Table of Contents

  1. Preface
    1. A brief history of machine learning
    2. AI is back, but why now?
    3. The emergence of applied AI
    4. Major milestones in applied AI over the past 20 years
    5. From narrow AI to general AI
    6. Objective and approach
    7. Prerequisites
    8. Roadmap
    9. Other Resources
    10. Conventions Used in This Book
    11. Using Code Examples
    12. O’Reilly Safari
    13. How to Contact Us
    14. Acknowledgments
  2. 1. Fundamentals of Unsupervised Learning
  3. 2. Unsupervised Learning in the Machine Learning Ecosystem
    1. Basic machine learning terminology
    2. Rules-based versus machine learning
    3. Supervised versus unsupervised
      1. The strengths and weaknesses of supervised learning
      2. The strengths and weaknesses of unsupervised learning
    4. Using unsupervised learning to improve machine learning solutions
    5. A closer look at supervised algorithms
      1. Linear methods
      2. Neighborhood-based methods
      3. Tree-based methods
      4. Support vector machines
      5. Neural networks
    6. A closer look at unsupervised algorithms
      1. Dimensionality reduction
      2. Clustering
      3. Feature extraction
      4. Unsupervised deep learning
      5. Sequential data problems using unsupervised learning
    7. Reinforcement learning using unsupervised learning
    8. Semi-supervised learning
    9. Successful applications of unsupervised learning
    10. Conclusion
  4. 3. End-to-End Machine Learning Project
    1. Environment setup
      1. Version control: Git
      2. Scientific libraries: Anaconda distribution of Python
      3. Neural networks: TensorFlow
      4. Gradient boosting, version one: XGBoost
      5. Gradient boosting, version two: LightGBM
      6. Interactive computing environment: Jupyter notebook
    2. Overview of the data
    3. Data preparation
      1. Data acquisition
      2. Data exploration
      3. Generate feature matrix and labels array
      4. Feature engineering and feature selection
      5. Data visualization
    4. Model preparation
      1. Split into training and test sets
      2. Select cost function
      3. Create k-fold cross validation sets
    5. Machine learning models - Part One
      1. Model One - Logistic Regression
    6. Evaluation metrics
      1. Confusion matrix
      2. Precision-recall curve
      3. Receiver operating characteristic
    7. Machine learning models - Part Two
      1. Model Two - Random Forests
      2. Model Three - Gradient Boosting Machine (XGBoost)
      3. Model Four - Gradient Boosting Machine (LightGBM)
    8. Evaluation of the four models using the test set
    9. Ensembles
      1. Stacking
    10. Final model selection
    11. Production pipeline
    12. Conclusion
  5. 4. Unsupervised Learning using Scikit-Learn
  6. 5. Dimensionality Reduction
    1. The motivation for dimensionality reduction
      1. The MNIST digits database
    2. Dimensionality reduction algorithms
      1. Linear projection versus manifold learning
    3. Principal component analysis (PCA)
      1. PCA the concept
      2. PCA in practice
      3. Incremental PCA
      4. Sparse PCA
      5. Kernel PCA
    4. Singular value decomposition
    5. Random projection
      1. Gaussian random projection
      2. Sparse random projection
    6. Linear discriminant analysis (LDA)
    7. Isomap
    8. Multidimensional scaling (MDS)
    9. Locally linear embedding
    10. t-distributed stochastic neighbor embedding (t-SNE)
    11. Other dimensionality reduction methods
    12. Dictionary learning
    13. Independent component analysis
    14. Conclusion
  7. 6. Anomaly Detection
    1. Credit card fraud detection
      1. Prepare the data
      2. Define anomaly score function
      3. Define evaluation metrics
    2. Normal PCA anomaly detection
      1. PCA components equal number of original dimensions
      2. Search for the optimal number of principal components
    3. Sparse PCA anomaly detection
    4. Kernel PCA anomaly detection
    5. Gaussian random projection anomaly detection
    6. Sparse random projection anomaly detection
    7. Non-linear anomaly detection
    8. Dictionary learning anomaly detection
    9. Independent component analysis anomaly detection
    10. Fraud Detection on the Test Set
      1. Normal PCA anomaly detection on the test set
      2. Independent component analysis anomaly detection on the test set
      3. Dictionary learning anomaly detection on the test set
    11. Unsupervised Learning Anomaly Detection
  8. 7. Clustering
    1. MNIST Digits Dataset
      1. Data preparation
    2. Clustering algorithms
    3. K-means
      1. K-means inertia
      2. Evaluating the clustering results
      3. K-means accuracy
      4. K-means and the number of principal components
      5. K-means on the original dataset
    4. Hierarchical clustering
      1. Agglomerative hierarchical clustering
      2. The dendrogram
      3. Evaluating the clustering results
    5. DBSCAN
      1. DBSCAN algorithm
      2. Applying DBSCAN to our dataset
      3. HDBSCAN
    6. Conclusion
  9. 8. Group Segmentation
    1. Lending Club Data
      1. Data preparation
    2. Goodness of the Clusters
    3. K-means Application
    4. Hierarchical Clustering Application
    5. DBSCAN Application
    6. Conclusion
  10. 9. Unsupervised Learning using TensorFlow and Keras
  11. 10. Autoencoders
    1. Neural networks
    2. TensorFlow
      1. TensorFlow Example
    3. Keras
    4. Encoder and decoder
    5. Undercomplete autoencoders
    6. Overcomplete autoencoders
    7. Dense versus sparse autoencoders
    8. Denoising autoencoder (DAE)
    9. Variational autoencoder (VAE)
    10. Conclusion
  12. 11. Hands-on Autoencoder
    1. Data Preparation
    2. A basic autoencoder
      1. Choosing the right activation function
    3. Two layer complete autoencoder with linear activation function
    4. Two layer undercomplete autoencoder with linear activation function
      1. Increasing the number of nodes
      2. Adding more hidden layers
    5. Non-Linear Autoencoder
    6. Overcomplete Autoencoder with Linear Activation
    7. Overcomplete Autoencoder with Linear Activation and Dropout
    8. Sparse Overcomplete Autoencoder with Linear Activation
    9. Sparse Overcomplete Autoencoder with Linear Activation and Dropout
    10. Working with Noisy Datasets
    11. Denoising autoencoder
      1. Two-layer Denoising Undercomplete Autoencoder with Linear Activation
      2. Two-layer Denoising Overcomplete Autoencoder with Linear Activation
      3. Two-layer Denoising Overcomplete Autoencoder with ReLu Activation
    12. Variational Autoencoders
    13. Conclusion
  13. 12. Semi-supervised Learning
    1. Data preparation
    2. Supervised model
    3. Unsupervised model
    4. Semi-supervised model
    5. The power of supervised and unsupervised
    6. Conclusion
  14. 13. Deep Unsupervised Learning using TensorFlow and Keras
  15. 14. Recommender Systems using Restricted Boltzmann Machines
    1. Boltzmann Machines
      1. Restricted Boltzmann Machines
    2. Recommender Systems
      1. Collaborative Filtering
      2. The Netflix Prize
    3. MovieLens Dataset
      1. Data Preparation
      2. Define the Cost Function: Mean Squared Error
    4. Matrix Factorization
      1. One Latent Factor
      2. Three Latent Factors
      3. Five Latent Factors
    5. Collaborative Filtering using RBMs
      1. RBM Neural Network Architecture
      2. Set up RBM Recommender System
      3. Train RBM Recommender System
    6. Conclusion