O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Applied Unsupervised Learning with Python

Book Description

Design clever algorithms that can uncover interesting structures and hidden relationships in unstructured, unlabeled data

Key Features

  • Learn how to select the most suitable Python library to solve your problem
  • Compare k-Nearest Neighbor (k-NN) and non-parametric methods and decide when to use them
  • Delve into the applications of neural networks using real-world datasets

Book Description

Unsupervised learning is a useful and practical solution in situations where labeled data is not available.

Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you'll learn what dimensionality reduction is and where to apply it. As you progress, you'll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. You will complete the course by challenging yourself through various interesting activities such as performing a Market Basket Analysis and identifying relationships between different merchandises.

By the end of this book, you will have the skills you need to confidently build your own models using Python.

What you will learn

  • Understand the basics and importance of clustering
  • Build k-means, hierarchical, and DBSCAN clustering algorithms from scratch with built-in packages
  • Explore dimensionality reduction and its applications
  • Use scikit-learn (sklearn) to implement and analyze principal component analysis (PCA) on the Iris dataset
  • Employ Keras to build autoencoder models for the CIFAR-10 dataset
  • Apply the Apriori algorithm with machine learning extensions (Mlxtend) to study transaction data

Who this book is for

This course is designed for developers, data scientists, and machine learning enthusiasts who are interested in unsupervised learning. Some familiarity with Python programming along with basic knowledge of mathematical concepts including exponents, square roots, means, and medians will be beneficial.

Downloading the example code for this ebook: You can download the example code files for this ebook on GitHub at the following link: https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-Python. If you require support please email: customercare@packt.com/p>

Table of Contents

  1. Preface
    1. About the Book
      1. About the Authors
      2. Learning Objectives
      3. Audience
      4. Approach
      5. Hardware Requirements
      6. Software Requirements
      7. Conventions
      8. Installation and Setup
      9. Install Anaconda on Windows
      10. Install Anaconda on Linux
      11. Install Anaconda on macOS
      12. Install Python on Windows
      13. Install Python on Linux
      14. Install Python on macOS X
      15. Additional Resources
  2. Chapter 1
  3. Introduction to Clustering
    1. Introduction
    2. Unsupervised Learning versus Supervised Learning
    3. Clustering
      1. Identifying Clusters
      2. Two-Dimensional Data
      3. Exercise 1: Identifying Clusters in Data
    4. Introduction to k-means Clustering
      1. No-Math k-means Walkthrough
      2. k-means Clustering In-Depth Walkthrough
      3. Alternative Distance Metric – Manhattan Distance
      4. Deeper Dimensions
      5. Exercise 2: Calculating Euclidean Distance in Python
      6. Exercise 3: Forming Clusters with the Notion of Distance
      7. Exercise 4: Implementing k-means from Scratch
      8. Exercise 5: Implementing k-means with Optimization
      9. Clustering Performance: Silhouette Score
      10. Exercise 6: Calculating the Silhouette Score
      11. Activity 1: Implementing k-means Clustering
    5. Summary
  4. Chapter 2
  5. Hierarchical Clustering
    1. Introduction
    2. Clustering Refresher
      1. k-means Refresher
    3. The Organization of Hierarchy
    4. Introduction to Hierarchical Clustering
      1. Steps to Perform Hierarchical Clustering
      2. An Example Walk-Through of Hierarchical Clustering
      3. Exercise 7: Building a Hierarchy
    5. Linkage
      1. Activity 2: Applying Linkage Criteria
    6. Agglomerative versus Divisive Clustering
      1. Exercise 8: Implementing Agglomerative Clustering with scikit-learn
      2. Activity 3: Comparing k-means with Hierarchical Clustering
    7. k-means versus Hierarchical Clustering
    8. Summary
  6. Chapter 3
  7. Neighborhood Approaches and DBSCAN
    1. Introduction
      1. Clusters as Neighborhoods
    2. Introduction to DBSCAN
      1. DBSCAN In-Depth
      2. Walkthrough of the DBSCAN Algorithm
      3. Exercise 9: Evaluating the Impact of Neighborhood Radius Size
      4. DBSCAN Attributes – Neighborhood Radius
      5. Activity 4: Implement DBSCAN from Scratch
      6. DBSCAN Attributes – Minimum Points
      7. Exercise 10: Evaluating the Impact of Minimum Points Threshold
      8. Activity 5: Comparing DBSCAN with k-means and Hierarchical Clustering
    3. DBSCAN Versus k-means and Hierarchical Clustering
    4. Summary
  8. Chapter 4
  9. Dimension Reduction and PCA
    1. Introduction
      1. What Is Dimensionality Reduction?
      2. Applications of Dimensionality Reduction
      3. The Curse of Dimensionality
    2. Overview of Dimensionality Reduction Techniques
      1. Dimensionality Reduction and Unsupervised Learning
    3. PCA
      1. Mean
      2. Standard Deviation
      3. Covariance
      4. Covariance Matrix
      5. Exercise 11: Understanding the Foundational Concepts of Statistics
      6. Eigenvalues and Eigenvectors
      7. Exercise 12: Computing Eigenvalues and Eigenvectors
      8. The Process of PCA
      9. Exercise 13: Manually Executing PCA
      10. Exercise 14: Scikit-Learn PCA
      11. Activity 6: Manual PCA versus scikit-learn
      12. Restoring the Compressed Dataset
      13. Exercise 15: Visualizing Variance Reduction with Manual PCA
      14. Exercise 16: Visualizing Variance Reduction with
      15. Exercise 17: Plotting 3D Plots in Matplotlib
      16. Activity 7: PCA Using the Expanded Iris Dataset
    4. Summary
  10. Chapter 5
  11. Autoencoders
    1. Introduction
    2. Fundamentals of Artificial Neural Networks
      1. The Neuron
      2. Sigmoid Function
      3. Rectified Linear Unit (ReLU)
      4. Exercise 18: Modeling the Neurons of an Artificial Neural Network
      5. Activity 8: Modeling Neurons with a ReLU Activation Function
      6. Neural Networks: Architecture Definition
      7. Exercise 19: Defining a Keras Model
      8. Neural Networks: Training
      9. Exercise 20: Training a Keras Neural Network Model
      10. Activity 9: MNIST Neural Network
    3. Autoencoders
      1. Exercise 21: Simple Autoencoder
      2. Activity 10: Simple MNIST Autoencoder
      3. Exercise 22: Multi-Layer Autoencoder
      4. Convolutional Neural Networks
      5. Exercise 23: Convolutional Autoencoder
      6. Activity 11: MNIST Convolutional Autoencoder
    4. Summary
  12. Chapter 6
  13. t-Distributed Stochastic Neighbor Embedding (t-SNE)
    1. Introduction
    2. Stochastic Neighbor Embedding (SNE)
    3. t-Distributed SNE
      1. Exercise 24: t-SNE MNIST
      2. Activity 12: Wine t-SNE
    4. Interpreting t-SNE Plots
      1. Perplexity
      2. Exercise 25: t-SNE MNIST and Perplexity
      3. Activity 13: t-SNE Wine and Perplexity
      4. Iterations
      5. Exercise 26: t-SNE MNIST and Iterations
      6. Activity 14: t-SNE Wine and Iterations
      7. Final Thoughts on Visualizations
    5. Summary
  14. Chapter 7
  15. Topic Modeling
    1. Introduction
      1. Topic Models
      2. Exercise 27: Setting Up the Environment
      3. A High-Level Overview of Topic Models
      4. Business Applications
      5. Exercise 28: Data Loading
    2. Cleaning Text Data
      1. Data Cleaning Techniques
      2. Exercise 29: Cleaning Data Step by Step
      3. Exercise 30: Complete Data Cleaning
      4. Activity 15: Loading and Cleaning Twitter Data
    3. Latent Dirichlet Allocation
      1. Variational Inference
      2. Bag of Words
      3. Exercise 31: Creating a Bag-of-Words Model Using the Count Vectorizer
      4. Perplexity
      5. Exercise 32: Selecting the Number of Topics
      6. Exercise 33: Running Latent Dirichlet Allocation
      7. Exercise 34: Visualize LDA
      8. Exercise 35: Trying Four Topics
      9. Activity 16: Latent Dirichlet Allocation and Health Tweets
      10. Bag-of-Words Follow-Up
      11. Exercise 36: Creating a Bag-of-Words Using TF-IDF
    4. Non-Negative Matrix Factorization
      1. Frobenius Norm
      2. Multiplicative Update
      3. Exercise 37: Non-negative Matrix Factorization
      4. Exercise 38: Visualizing NMF
      5. Activity 17: Non-Negative Matrix Factorization
    5. Summary
  16. Chapter 8
  17. Market Basket Analysis
    1. Introduction
    2. Market Basket Analysis
      1. Use Cases
      2. Important Probabilistic Metrics
      3. Exercise 39: Creating Sample Transaction Data
      4. Support
      5. Confidence
      6. Lift and Leverage
      7. Conviction
      8. Exercise 40: Computing Metrics
    3. Characteristics of Transaction Data
      1. Exercise 41: Loading Data
      2. Data Cleaning and Formatting
      3. Exercise 42: Data Cleaning and Formatting
      4. Data Encoding
      5. Exercise 43: Data Encoding
      6. Activity 18: Loading and Preparing Full Online Retail Data
    4. Apriori Algorithm
      1. Computational Fixes
      2. Exercise 44: Executing the Apriori algorithm
      3. Activity 19: Apriori on the Complete Online Retail Dataset
    5. Association Rules
      1. Exercise 45: Deriving Association Rules
      2. Activity 20: Finding the Association Rules on the Complete Online Retail Dataset
    6. Summary
  18. Chapter 9
  19. Hotspot Analysis
    1. Introduction
      1. Spatial Statistics
      2. Probability Density Functions
      3. Using Hotspot Analysis in Business
    2. Kernel Density Estimation
      1. The Bandwidth Value
      2. Exercise 46: The Effect of the Bandwidth Value
      3. Selecting the Optimal Bandwidth
      4. Exercise 47: Selecting the Optimal Bandwidth Using Grid Search
      5. Kernel Functions
      6. Exercise 48: The Effect of the Kernel Function
      7. Kernel Density Estimation Derivation
      8. Exercise 49: Simulating the Derivation of Kernel Density Estimation
      9. Activity 21: Estimating Density in One Dimension
    3. Hotspot Analysis
      1. Exercise 50: Loading Data and Modeling with Seaborn
      2. Exercise 51: Working with Basemaps
      3. Activity 22: Analyzing Crime in London
    4. Summary
  20. Appendix
    1. Chapter 1: Introduction to Clustering
      1. Activity 1: Implementing k-means Clustering
    2. Chapter 2: Hierarchical Clustering
      1. Activity 3: Comparing k-means with Hierarchical Clustering
    3. Chapter 3: Neighborhood Approaches and DBSCAN
      1. Activity 4: Implement DBSCAN from Scratch
      2. Activity 5: Comparing DBSCAN with k-means and Hierarchical Clustering
    4. Chapter 4: Dimension Reduction and PCA
      1. Activity 6: Manual PCA versus scikit-learn
      2. Activity 7: PCA Using the Expanded Iris Dataset
    5. Chapter 5: Autoencoders
      1. Activity 8: Modeling Neurons with a ReLU Activation Function
      2. Activity 9: MNIST Neural Network
      3. Activity 10: Simple MNIST Autoencoder
      4. Activity 11: MNIST Convolutional Autoencoder
    6. Chapter 6: t-Distributed Stochastic Neighbor Embedding (t-SNE)
      1. Activity 12: Wine t-SNE
      2. Activity 13: t-SNE Wine and Perplexity
      3. Activity 14: t-SNE Wine and Iterations
    7. Chapter 7: Topic Modeling
      1. Activity 15: Loading and Cleaning Twitter Data
      2. Activity 16: Latent Dirichlet Allocation and Health Tweets
      3. Activity 17: Non-Negative Matrix Factorization
    8. Chapter 8: Market Basket Analysis
      1. Activity 18: Loading and Preparing Full Online Retail Data
      2. Activity 19: Apriori on the Complete Online Retail Dataset
      3. Activity 20: Finding the Association Rules on the Complete Online Retail Dataset
    9. Chapter 9: Hotspot Analysis
      1. Activity 21: Estimating Density in One Dimension
      2. Activity 22: Analyzing Crime in London