O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Applied Unsupervised Learning with R

Book Description

Design clever algorithms that discover hidden patterns and draw responses from unstructured, unlabeled data.

Key Features

  • Build state-of-the-art algorithms that can solve your business' problems
  • Learn how to find hidden patterns in your data
  • Revise key concepts with hands-on exercises using real-world datasets

Book Description

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions.

This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models.

By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.

What you will learn

  • Implement clustering methods such as k-means, agglomerative, and divisive
  • Write code in R to analyze market segmentation and consumer behavior
  • Estimate distribution and probabilities of different outcomes
  • Implement dimension reduction using principal component analysis
  • Apply anomaly detection methods to identify fraud
  • Design algorithms with R and learn how to edit or improve code

Who this book is for

Applied Unsupervised Learning with R is designed for business professionals who want to learn about methods to understand their data better, and developers who have an interest in unsupervised learning. Although the book is for beginners, it will be beneficial to have some basic, beginner-level familiarity with R. This includes an understanding of how to open the R console, how to read data, and how to create a loop. To easily understand the concepts of this book, you should also know basic mathematical concepts, including exponents, square roots, means, and medians.

Table of Contents

  1. Preface
    1. About the Book
      1. About the Authors
      2. Elevator Pitch
      3. Key Features
      4. Description
      5. Learning Objectives
      6. Audience
      7. Approach
      8. Hardware Requirements
      9. Software Requirements
      10. Conventions
      11. Installation and Setup
      12. Installing R on Windows
      13. Installing R on macOS X
      14. Installing R on Linux
  2. Chapter 1
  3. Introduction to Clustering Methods
    1. Introduction
    2. Introduction to Clustering
      1. Uses of Clustering
    3. Introduction to the Iris Dataset
      1. Exercise 1: Exploring the Iris Dataset
      2. Types of Clustering
    4. Introduction to k-means Clustering
      1. Euclidean Distance
      2. Manhattan Distance
      3. Cosine Distance
      4. The Hamming Distance
      5. k-means Clustering Algorithm
      6. Steps to Implement k-means Clustering
      7. Exercise 2: Implementing k-means Clustering on the Iris Dataset
      8. Activity 1: k-means Clustering with Three Clusters
    5. Introduction to k-means Clustering with Built-In Functions
      1. k-means Clustering with Three Clusters
      2. Exercise 3: k-means Clustering with R Libraries
    6. Introduction to Market Segmentation
      1. Exercise 4: Exploring the Wholesale Customer Dataset
      2. Activity 2: Customer Segmentation with k-means
    7. Introduction to k-medoids Clustering
      1. The k-medoids Clustering Algorithm
      2. k-medoids Clustering Code
      3. Exercise 5: Implementing k-medoid Clustering
      4. k-means Clustering versus k-medoids Clustering
      5. Activity 3: Performing Customer Segmentation with k-medoids Clustering
      6. Deciding the Optimal Number of Clusters
      7. Types of Clustering Metrics
      8. Silhouette Score
      9. Exercise 6: Calculating the Silhouette Score
      10. Exercise 7: Identifying the Optimum Number of Clusters
      11. WSS/Elbow Method
      12. Exercise 8: Using WSS to Determine the Number of Clusters
      13. The Gap Statistic
      14. Exercise 9: Calculating the Ideal Number of Clusters with the Gap Statistic
      15. Activity 4: Finding the Ideal Number of Market Segments
    8. Summary
  4. Chapter 2
  5. Advanced Clustering Methods
    1. Introduction
    2. Introduction to k-modes Clustering
      1. Steps for k-Modes Clustering
      2. Exercise 10: Implementing k-modes Clustering
      3. Activity 5: Implementing k-modes Clustering on the Mushroom Dataset
    3. Introduction to Density-Based Clustering (DBSCAN)
      1. Steps for DBSCAN
      2. Exercise 11: Implementing DBSCAN
      3. Uses of DBSCAN
      4. Activity 6: Implementing DBSCAN and Visualizing the Results
      5. Introduction to Hierarchical Clustering
      6. Types of Similarity Metrics
      7. Steps to Perform Agglomerative Hierarchical Clustering
      8. Exercise 12: Agglomerative Clustering with Different Similarity Measures
      9. Divisive Clustering
      10. Steps to Perform Divisive Clustering
      11. Exercise 13: Performing DIANA Clustering
      12. Activity 7: Performing Hierarchical Cluster Analysis on the Seeds Dataset
    4. Summary
  6. Chapter 3
  7. Probability Distributions
    1. Introduction
    2. Basic Terminology of Probability Distributions
      1. Uniform Distribution
      2. Exercise 14: Generating and Plotting Uniform Samples in R
      3. Normal Distribution
      4. Exercise 15: Generating and Plotting a Normal Distribution in R
      5. Skew and Kurtosis
      6. Log-Normal Distributions
      7. Exercise 16: Generating a Log-Normal Distribution from a Normal Distribution
      8. The Binomial Distribution
      9. Exercise 17: Generating a Binomial Distribution
      10. The Poisson Distribution
      11. The Pareto Distribution
    3. Introduction to Kernel Density Estimation
      1. KDE Algorithm
      2. Exercise 18: Visualizing and Understanding KDE
      3. Exercise 19: Studying the Effect of Changing Kernels on a Distribution
      4. Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset
    4. Introduction to the Kolmogorov-Smirnov Test
      1. The Kolmogorov-Smirnov Test Algorithm
      2. Exercise 20: Performing the Kolmogorov-Smirnov Test on Two Samples
      3. Activity 9: Calculating the CDF and Performing the Kolmogorov-Smirnov Test with the Normal Distribution
    5. Summary
  8. Chapter 4
  9. Dimension Reduction
    1. Introduction
      1. The Idea of Dimension Reduction
      2. Exercise 21: Examining a Dataset that Contains the Chemical Attributes of Different Wines
      3. Importance of Dimension Reduction
    2. Market Basket Analysis
      1. Exercise 22: Data Preparation for the Apriori Algorithm
      2. Exercise 23: Passing through the Data to Find the Most Common Baskets
      3. Exercise 24: More Passes through the Data
      4. Exercise 25: Generating Associative Rules as the Final Step of the Apriori Algorithm
      5. Principal Component Analysis
      6. Linear Algebra Refresher
      7. Matrices
      8. Variance
      9. Covariance
      10. Exercise 26: Examining Variance and Covariance on the Wine Dataset
      11. Eigenvectors and Eigenvalues
      12. The Idea of PCA
      13. Exercise 27: Performing PCA
      14. Exercise 28: Performing Dimension Reduction with PCA
      15. Activity 10: Performing PCA and Market Basket Analysis on a New Dataset
    3. Summary
  10. Chapter 5
  11. Data Comparison Methods
    1. Introduction
      1. Hash Functions
      2. Exercise 29: Creating and Using a Hash Function
      3. Exercise 30: Verifying Our Hash Function
    2. Analytic Signatures
      1. Exercise 31: Perform the Data Preparation for Creating an Analytic Signature for an Image
      2. Exercise 32: Creating a Brightness Comparison Function
      3. Exercise 33: Creating a Function to Compare Image Sections to All of the Neighboring Sections
      4. Exercise 34: Creating a Function that Generates an Analytic Signature for an Image
      5. Activity 11: Creating an Image Signature for a Photograph of a Person
    3. Comparison of Signatures
      1. Activity 12: Creating an Image Signature for the Watermarked Image
      2. Applying Other Unsupervised Learning Methods to Analytic Signatures
    4. Latent Variable Models – Factor Analysis
      1. Exercise 35: Preparing for Factor Analysis
      2. Linear Algebra behind Factor Analysis
      3. Exercise 36: More Exploration with Factor Analysis
      4. Activity 13: Performing Factor Analysis
    5. Summary
  12. Chapter 6
  13. Anomaly Detection
    1. Introduction
    2. Univariate Outlier Detection
      1. Exercise 37: Performing an Exploratory Visual Check for Outliers Using R's boxplot Function
      2. Exercise 38: Transforming a Fat-Tailed Dataset to Improve Outlier Classification
      3. Exercise 39: Finding Outliers without Using R's Built-In boxplot Function
      4. Exercise 40: Detecting Outliers Using a Parametric Method
      5. Multivariate Outlier Detection
      6. Exercise 41: Calculating Mahalanobis Distance
      7. Detecting Anomalies in Clusters
      8. Other Methods for Multivariate Outlier Detection
      9. Exercise 42: Classifying Outliers based on Comparisons of Mahalanobis Distances
      10. Detecting Outliers in Seasonal Data
      11. Exercise 43: Performing Seasonality Modeling
      12. Exercise 44: Finding Anomalies in Seasonal Data Using a Parametric Method
      13. Contextual and Collective Anomalies
      14. Exercise 45: Detecting Contextual Anomalies
      15. Exercise 46: Detecting Collective Anomalies
    3. Kernel Density
      1. Exercise 47: Finding Anomalies Using Kernel Density Estimation
      2. Continuing in Your Studies of Anomaly Detection
      3. Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method
      4. Activity 15: Using Mahalanobis Distance to Find Anomalies
    4. Summary
  14. Appendix
    1. Chapter 1: Introduction to Clustering Methods
      1. Activity 1: k-means Clustering with Three Clusters
      2. Activity 2: Customer Segmentation with k-means
      3. Activity 3: Performing Customer Segmentation with k-medoids Clustering
      4. Activity 4: Finding the Ideal Number of Market Segments
    2. Chapter 2: Advanced Clustering Methods
      1. Activity 5: Implementing k-modes Clustering on the Mushroom Dataset
      2. Activity 6: Implementing DBSCAN and Visualizing the Results
      3. Activity 7: Performing a Hierarchical Cluster Analysis on the Seeds Dataset
    3. Chapter 3: Probability Distributions
      1. Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset
      2. Activity 9: Calculating the CDF and Performing the Kolmogorov-Simonov Test with the Normal Distribution
    4. Chapter 4: Dimension Reduction
      1. Activity 10: Performing PCA and Market Basket Analysis on a New Dataset
    5. Chapter 5: Data Comparison Methods
      1. Activity 11: Create an Image Signature for a Photograph of a Person
      2. Activity 12: Create an Image Signature for the Watermarked Image
      3. Activity 13: Performing Factor Analysis
    6. Chapter 6: Anomaly Detection
      1. Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method
      2. Activity 15: Using Mahalanobis Distance to Find Anomalies