O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Clustering

Book Description

Research on the problem of clustering tends to be fragmented across the pattern recognition, database, data mining, and machine learning communities. Addressing this problem in a unified way, Data Clustering: Algorithms and Applications provides complete coverage of the entire area of clustering, from basic methods to more refined and complex data clustering approaches. It pays special attention to recent issues in graphs, social networks, and other domains.

The book focuses on three primary aspects of data clustering:

  • Methods, describing key techniques commonly used for clustering, such as feature selection, agglomerative clustering, partitional clustering, density-based clustering, probabilistic clustering, grid-based clustering, spectral clustering, and nonnegative matrix factorization
  • Domains, covering methods used for different domains of data, such as categorical data, text data, multimedia data, graph data, biological data, stream data, uncertain data, time series clustering, high-dimensional clustering, and big data
  • Variations and Insights, discussing important variations of the clustering process, such as semisupervised clustering, interactive clustering, multiview clustering, cluster ensembles, and cluster validation

In this book, top researchers from around the world explore the characteristics of clustering problems in a variety of application areas. They also explain how to glean detailed insight from the clustering process—including how to verify the quality of the underlying clusters—through supervision, human intervention, or the automated generation of alternative clusters.

Table of Contents

  1. Cover
  2. Preliminaries
  3. Preface
  4. Editor Biographies
  5. Contributors
  6. Chapter 1 An Introduction to Cluster Analysis
    1. 1.1 Introduction
    2. 1.2 Common Techniques Used in Cluster Analysis
      1. 1.2.1 Feature Selection Methods
      2. 1.2.2 Probabilistic and Generative Models
      3. 1.2.3 Distance-Based Algorithms
      4. 1.2.4 Density- and Grid-Based Methods
      5. 1.2.5 Leveraging Dimensionality Reduction Methods
        1. 1.2.5.1 Generative Models for Dimensionality Reduction
        2. 1.2.5.2 Matrix Factorization and Co-Clustering
        3. 1.2.5.3 Spectral Methods
      6. 1.2.6 The High Dimensional Scenario
      7. 1.2.7 Scalable Techniques for Cluster Analysis
        1. 1.2.7.1 I/O Issues in Database Management
        2. 1.2.7.2 Streaming Algorithms
        3. 1.2.7.3 The Big Data Framework
    3. 1.3 Data Types Studied in Cluster Analysis
      1. 1.3.1 Clustering Categorical Data
      2. 1.3.2 Clustering Text Data
      3. 1.3.3 Clustering Multimedia Data
      4. 1.3.4 Clustering Time-Series Data
      5. 1.3.5 Clustering Discrete Sequences
      6. 1.3.6 Clustering Network Data
      7. 1.3.7 Clustering Uncertain Data
    4. 1.4 Insights Gained from Different Variations of Cluster Analysis
      1. 1.4.1 Visual Insights
      2. 1.4.2 Supervised Insights
      3. 1.4.3 Multiview and Ensemble-Based Insights
      4. 1.4.4 Validation-Based Insights
    5. 1.5 Discussion and Conclusions
    6. Bibliography
  7. Chapter 2 Feature Selection for Clustering: A Review
    1. 2.1 Introduction
      1. 2.1.1 Data Clustering
      2. 2.1.2 Feature Selection
      3. 2.1.3 Feature Selection for Clustering
        1. 2.1.3.1 Filter Model
        2. 2.1.3.2 Wrapper Model
        3. 2.1.3.3 Hybrid Model
    2. 2.2 Feature Selection for Clustering
      1. 2.2.1 Algorithms for Generic Data
        1. 2.2.1.1 Spectral Feature Selection (SPEC)
        2. 2.2.1.2 Laplacian Score (LS)
        3. 2.2.1.3 Feature Selection for Sparse Clustering
        4. 2.2.1.4 Localized Feature Selection Based on Scatter Separability (LFSBSS)
        5. 2.2.1.5 Multicluster Feature Selection (MCFS)
        6. 2.2.1.6 Feature Weighting k-Means
      2. 2.2.2 Algorithms for Text Data
        1. 2.2.2.1 Term Frequency (TF)
        2. 2.2.2.2 Inverse Document Frequency (IDF)
        3. 2.2.2.3 Term Frequency-Inverse Document Frequency (TF-IDF)
        4. 2.2.2.4 Chi Square Statistic
        5. 2.2.2.5 Frequent Term-Based Text Clustering
        6. 2.2.2.6 Frequent Term Sequence
      3. 2.2.3 Algorithms for Streaming Data
        1. 2.2.3.1 Text Stream Clustering Based on Adaptive Feature Selection (TSC-AFS)
        2. 2.2.3.2 High-Dimensional Projected Stream Clustering (HPStream)
      4. 2.2.4 Algorithms for Linked Data
        1. 2.2.4.1 Challenges and Opportunities
        2. 2.2.4.2 LUFS: An Unsupervised Feature Selection Framework for Linked Data
        3. 2.2.4.3 Conclusion and Future Work for Linked Data
    3. 2.3 Discussions and Challenges
      1. 2.3.1 The Chicken or the Egg Dilemma
      2. 2.3.2 Model Selection: K and l
      3. 2.3.3 Scalability
      4. 2.3.4 Stability
    4. Bibliography
      1. Figure 2.1
      2. Figure 2.2
      3. Figure 2.3
      4. Figure 2.4
      1. Table 2.1
  8. Chapter 3 Probabilistic Models for Clustering
    1. 3.1 Introduction
    2. 3.2 Mixture Models
      1. 3.2.1 Overview
      2. 3.2.2 Gaussian Mixture Model
      3. 3.2.3 Bernoulli Mixture Model
      4. 3.2.4 Model Selection Criteria
    3. 3.3 EM Algorithm and Its Variations
      1. 3.3.1 The General EM Algorithm
      2. 3.3.2 Mixture Models Revisited
      3. 3.3.3 Limitations of the EM Algorithm
      4. 3.3.4 Applications of the EM Algorithm
    4. 3.4 Probabilistic Topic Models
      1. 3.4.1 Probabilistic Latent Semantic Analysis
      2. 3.4.2 Latent Dirichlet Allocation
      3. 3.4.3 Variations and Extensions
    5. 3.5 Conclusions and Summary
    6. Bibliography
      1. Figure 3.1
      2. Figure 3.2
      3. Figure 3.3
      4. Figure 3.4
      5. Figure 3.5
      6. Figure 3.6
      7. Figure 3.7
  9. Chapter 4 A Survey of Partitional and Hierarchical Clustering Algorithms
    1. 4.1 Introduction
    2. 4.2 Partitional Clustering Algorithms
      1. 4.2.1 K-Means Clustering
      2. 4.2.2 Minimization of Sum of Squared Errors
      3. 4.2.3 Factors Affecting K-Means
        1. 4.2.3.1 Popular Initialization Methods
        2. 4.2.3.2 Estimating the Number of Clusters
      4. 4.2.4 Variations of K-Means
        1. 4.2.4.1 K-Medoids Clustering
        2. 4.2.4.2 K-Medians Clustering
        3. 4.2.4.3 K-Modes Clustering
        4. 4.2.4.4 Fuzzy K-Means Clustering
        5. 4.2.4.5 X-Means Clustering
        6. 4.2.4.6 Intelligent K-Means Clustering
        7. 4.2.4.7 Bisecting K-Means Clustering
        8. 4.2.4.8 Kernel K-Means Clustering
        9. 4.2.4.9 Mean Shift Clustering
        10. 4.2.4.10 Weighted K-Means Clustering
        11. 4.2.4.11 Genetic K-Means Clustering
      5. 4.2.5 Making K-Means Faster
    3. 4.3 Hierarchical Clustering Algorithms
      1. 4.3.1 Agglomerative Clustering
        1. 4.3.1.1 Single and Complete Link
        2. 4.3.1.2 Group Averaged and Centroid Agglomerative Clustering
        3. 4.3.1.3 Ward’s Criterion
        4. 4.3.1.4 Agglomerative Hierarchical Clustering Algorithm
        5. 4.3.1.5 Lance–Williams Dissimilarity Update Formula
      2. 4.3.2 Divisive Clustering
        1. 4.3.2.1 Issues in Divisive Clustering
        2. 4.3.2.2 Divisive Hierarchical Clustering Algorithm
        3. 4.3.2.3 Minimum Spanning Tree-Based Clustering
      3. 4.3.3 Other Hierarchical Clustering Algorithms
    4. 4.4 Discussion and Summary
    5. Bibliography
      1. Figure 4.1
      2. Figure 4.2
      1. Table 4.1
  10. Chapter 5 Density-Based Clustering
    1. 5.1 Introduction
    2. 5.2 DBSCAN
    3. 5.3 DENCLUE
    4. 5.4 OPTICS
    5. 5.5 Other Algorithms
    6. 5.6 Subspace Clustering
    7. 5.7 Clustering Networks
    8. 5.8 Other Directions
    9. 5.9 Conclusion
    10. Bibliography
      1. Figure 5.1
      2. Figure 5.2
      3. Figure 5.3
      4. Figure 5.4
      5. Figure 5.5
      6. Figure 5.6
  11. Chapter 6 Grid-Based Clustering
    1. 6.1 Introduction
    2. 6.2 The Classical Algorithms
      1. 6.2.1 Earliest Approaches: GRIDCLUS and BANG
      2. 6.2.2 STING and STING+: The Statistical Information Grid Approach
      3. 6.2.3 WaveCluster: Wavelets in Grid-Based Clustering
    3. 6.3 Adaptive Grid-Based Algorithms
      1. 6.3.1 AMR: Adaptive Mesh Refinement Clustering
    4. 6.4 Axis-Shifting Grid-Based Algorithms
      1. 6.4.1 NSGC: New Shifting Grid Clustering Algorithm
      2. 6.4.2 ADCC: Adaptable Deflect and Conquer Clustering
      3. 6.4.3 ASGC: Axis-Shifted Grid-Clustering
      4. 6.4.4 GDILC: Grid-Based Density-IsoLine Clustering Algorithm
    5. 6.5 High-Dimensional Algorithms
      1. 6.5.1 CLIQUE: The Classical High-Dimensional Algorithm
      2. 6.5.2 Variants of CLIQUE
        1. 6.5.2.1 ENCLUS: Entropy-Based Approach
        2. 6.5.2.2 MAFIA: Adaptive Grids in High Dimensions
      3. 6.5.3 OptiGrid: Density-Based Optimal Grid Partitioning
      4. 6.5.4 Variants of the OptiGrid Approach
        1. 6.5.4.1 O-Cluster: A Scalable Approach
        2. 6.5.4.2 CBF: Cell-Based Filtering
    6. 6.6 Conclusions and Summary
    7. Bibliography
      1. Figure 6.1
      2. Figure 6.2
      3. Figure 6.3
      1. Table 6.1
      2. Table 6.2
      3. Table 6.3
      4. Table 6.4
  12. Chapter 7 Nonnegative Matrix Factorizations for Clustering: A Survey
    1. 7.1 Introduction
      1. 7.1.1 Background
      2. 7.1.2 NMF Formulations
    2. 7.2 NMF for Clustering: Theoretical Foundations
      1. 7.2.1 NMF and K-Means Clustering
      2. 7.2.2 NMF and Probabilistic Latent Semantic Indexing
      3. 7.2.3 NMF and Kernel K-Means and Spectral Clustering
      4. 7.2.4 NMF Boundedness Theorem
    3. 7.3 NMF Clustering Capabilities
      1. 7.3.1 Examples
      2. 7.3.2 Analysis
    4. 7.4 NMF Algorithms
      1. 7.4.1 Introduction
      2. 7.4.2 Algorithm Development
      3. 7.4.3 Practical Issues in NMF Algorithms
        1. 7.4.3.1 Initialization
        2. 7.4.3.2 Stopping Criteria
        3. 7.4.3.3 Objective Function vs. Clustering Performance
        4. 7.4.3.4 Scalability
    5. 7.5 NMF Related Factorizations
    6. 7.6 NMF for Clustering: Extensions
      1. 7.6.1 Co-Clustering
      2. 7.6.2 Semisupervised Clustering
      3. 7.6.3 Semisupervised Co-Clustering
      4. 7.6.4 Consensus Clustering
      5. 7.6.5 Graph Clustering
      6. 7.6.6 Other Clustering Extensions
    7. 7.7 Conclusions
    8. Acknowledgment
    9. Bibliography
      1. Figure 7.1
      2. Figure 7.2
      3. Figure 7.3
  13. Chapter 8 Spectral Clustering
    1. 8.1 Introduction
    2. 8.2 Similarity Graph
    3. 8.3 Unnormalized Spectral Clustering
      1. 8.3.1 Notation
      2. 8.3.2 Unnormalized Graph Laplacian
      3. 8.3.3 Spectrum Analysis
      4. 8.3.4 Unnormalized Spectral Clustering Algorithm
    4. 8.4 Normalized Spectral Clustering
      1. 8.4.1 Normalized Graph Laplacian
      2. 8.4.2 Spectrum Analysis
      3. 8.4.3 Normalized Spectral Clustering Algorithm
    5. 8.5 Graph Cut View
      1. 8.5.1 Ratio Cut Relaxation
      2. 8.5.2 Normalized Cut Relaxation
    6. 8.6 Random Walks View
    7. 8.7 Connection to Laplacian Eigenmap
    8. 8.8 Connection to Kernel k-Means and Nonnegative Matrix Factorization
    9. 8.9 Large Scale Spectral Clustering
    10. 8.10 Further Reading
    11. Bibliography
      1. Figure 8.1
      2. Figure 8.2
      3. Figure 8.3
  14. Chapter 9 Clustering High-Dimensional Data
    1. 9.1 Introduction
    2. 9.2 The “Curse of Dimensionality”
      1. 9.2.1 Different Aspects of the “Curse”
      2. 9.2.2 Consequences
    3. 9.3 Clustering Tasks in Subspaces of High-Dimensional Data
      1. 9.3.1 Categories of Subspaces
        1. 9.3.1.1 Axis-Parallel Subspaces
        2. 9.3.1.2 Arbitrarily Oriented Subspaces
        3. 9.3.1.3 Special Cases
      2. 9.3.2 Search Spaces for the Clustering Problem
    4. 9.4 Fundamental Algorithmic Ideas
      1. 9.4.1 Clustering in Axis-Parallel Subspaces
        1. 9.4.1.1 Cluster Model
        2. 9.4.1.2 Basic Techniques
        3. 9.4.1.3 Clustering Algorithms
      2. 9.4.2 Clustering in Arbitrarily Oriented Subspaces
        1. 9.4.2.1 Cluster Model
        2. 9.4.2.2 Basic Techniques and Example Algorithms
    5. 9.5 Open Questions and Current Research Directions
    6. 9.6 Conclusion
    7. Bibliography
  15. Chapter 10 A Survey of Stream Clustering Algorithms
    1. 10.1 Introduction
    2. 10.2 Methods Based on Partitioning Representatives
      1. 10.2.1 The STREAM Algorithm
      2. 10.2.2 CluStream: The Microclustering Framework
        1. 10.2.2.1 Microcluster Definition
        2. 10.2.2.2 Pyramidal Time Frame
        3. 10.2.2.3 Online Clustering with CluStream
    3. 10.3 Density-Based Stream Clustering
      1. 10.3.1 DenStream: Density-Based Microclustering
      2. 10.3.2 Grid-Based Streaming Algorithms
        1. 10.3.2.1 D-Stream Algorithm
        2. 10.3.2.2 Other Grid-Based Algorithms
    4. 10.4 Probabilistic Streaming Algorithms
    5. 10.5 Clustering High-Dimensional Streams
      1. 10.5.1 The HPSTREAM Method
      2. 10.5.2 Other High-Dimensional Streaming Algorithms
    6. 10.6 Clustering Discrete and Categorical Streams
      1. 10.6.1 Clustering Binary Data Streams with k-Means
      2. 10.6.2 The StreamCluCD Algorithm
      3. 10.6.3 Massive-Domain Clustering
    7. 10.7 Text Stream Clustering
    8. 10.8 Other Scenarios for Stream Clustering
      1. 10.8.1 Clustering Uncertain Data Streams
      2. 10.8.2 Clustering Graph Streams
      3. 10.8.3 Distributed Clustering of Data Streams
    9. 10.9 Discussion and Conclusions
    10. Bibliography
      1. Figure 10.1
      1. Table 10.1
  16. Chapter 11 Big Data Clustering
    1. 11.1 Introduction
    2. 11.2 One-Pass Clustering Algorithms
      1. 11.2.1 CLARANS: Fighting with Exponential Search Space
      2. 11.2.2 BIRCH: Fighting with Limited Memory
      3. 11.2.3 CURE: Fighting with the Irregular Clusters
    3. 11.3 Randomized Techniques for Clustering Algorithms
      1. 11.3.1 Locality-Preserving Projection
      2. 11.3.2 Global Projection
    4. 11.4 Parallel and Distributed Clustering Algorithms
      1. 11.4.1 General Framework
      2. 11.4.2 DBDC: Density-Based Clustering
      3. 11.4.3 ParMETIS: Graph Partitioning
      4. 11.4.4 PKMeans: K-Means with MapReduce
      5. 11.4.5 DisCo: Co-Clustering with MapReduce
      6. 11.4.6 BoW: Subspace Clustering with MapReduce
    5. 11.5 Conclusion
    6. Bibliography
      1. Figure 11.1
      2. Figure 11.2
      3. Figure 11.3
      4. Figure 11.4
      5. Figure 11.5
      6. Figure 11.6
      7. Figure 11.7
      8. Figure 11.8
      9. Figure 11.9
      10. Figure 11.10
  17. Chapter 12 Clustering Categorical Data
    1. 12.1 Introduction
    2. 12.2 Goals of Categorical Clustering
      1. 12.2.1 Clustering Road Map
    3. 12.3 Similarity Measures for Categorical Data
      1. 12.3.1 The Hamming Distance in Categorical and Binary Data
      2. 12.3.2 Probabilistic Measures
      3. 12.3.3 Information-Theoretic Measures
      4. 12.3.4 Context-Based Similarity Measures
    4. 12.4 Descriptions of Algorithms
      1. 12.4.1 Partition-Based Clustering
        1. 12.4.1.1 k-Modes
        2. 12.4.1.2 k-Prototypes (Mixed Categorical and Numerical)
        3. 12.4.1.3 Fuzzy k-Modes
        4. 12.4.1.4 Squeezer
        5. 12.4.1.5 COOLCAT
      2. 12.4.2 Hierarchical Clustering
        1. 12.4.2.1 ROCK
        2. 12.4.2.2 COBWEB
        3. 12.4.2.3 LIMBO
      3. 12.4.3 Density-Based Clustering
        1. 12.4.3.1 Projected (Subspace) Clustering
        2. 12.4.3.2 CACTUS
        3. 12.4.3.3 CLICKS
        4. 12.4.3.4 STIRR
        5. 12.4.3.5 CLOPE
        6. 12.4.3.6 HIERDENC: Hierarchical Density-Based Clustering
        7. 12.4.3.7 MULIC: Multiple Layer Incremental Clustering
      4. 12.4.4 Model-Based Clustering
        1. 12.4.4.1 BILCOM Empirical Bayesian (Mixed Categorical and Numerical)
        2. 12.4.4.2 AutoClass (Mixed Categorical and Numerical)
        3. 12.4.4.3 SVM Clustering (Mixed Categorical and Numerical)
    5. 12.5 Conclusion
    6. Bibliography
      1. Figure 12.1
      2. Figure 12.2
      3. Figure 12.3
      4. Figure 12.4
      5. Figure 12.5
      6. Figure 12.6
      7. Figure 12.7
      8. Figure 12.8
      9. Figure 12.9
      1. Table 12.1
      2. Table 12.2
  18. Chapter 13 Document Clustering: The Next Frontier
    1. 13.1 Introduction
    2. 13.2 Modeling a Document
      1. 13.2.1 Preliminaries
      2. 13.2.2 The Vector Space Model
      3. 13.2.3 Alternate Document Models
      4. 13.2.4 Dimensionality Reduction for Text
      5. 13.2.5 Characterizing Extremes
    3. 13.3 General Purpose Document Clustering
      1. 13.3.1 Similarity/Dissimilarity-Based Algorithms
      2. 13.3.2 Density-Based Algorithms
      3. 13.3.3 Adjacency-Based Algorithms
      4. 13.3.4 Generative Algorithms
    4. 13.4 Clustering Long Documents
      1. 13.4.1 Document Segmentation
      2. 13.4.2 Clustering Segmented Documents
      3. 13.4.3 Simultaneous Segment Identification and Clustering
    5. 13.5 Clustering Short Documents
      1. 13.5.1 General Methods for Short Document Clustering
      2. 13.5.2 Clustering with Knowledge Infusion
      3. 13.5.3 Clustering Web Snippets
      4. 13.5.4 Clustering Microblogs
    6. 13.6 Conclusion
    7. Bibliography
      1. Figure 13.1
      2. Figure 13.2
      3. Figure 13.3
      4. Figure 13.4
      5. Figure 13.5
      6. Figure 13.6
      7. Figure 13.7
      1. Table 13.1
      2. Table 13.2
  19. Chapter 14 Clustering Multimedia Data
    1. 14.1 Introduction
    2. 14.2 Clustering with Image Data
      1. 14.2.1 Visual Words Learning
      2. 14.2.2 Face Clustering and Annotation
      3. 14.2.3 Photo Album Event Recognition
      4. 14.2.4 Image Segmentation
      5. 14.2.5 Large-Scale Image Classification
    3. 14.3 Clustering with Video and Audio Data
      1. 14.3.1 Video Summarization
      2. 14.3.2 Video Event Detection
      3. 14.3.3 Video Story Clustering
      4. 14.3.4 Music Summarization
    4. 14.4 Clustering with Multimodal Data
    5. 14.5 Summary and Future Directions
    6. Bibliography
  20. Chapter 15 Time-Series Data Clustering
    1. 15.1 Introduction
    2. 15.2 The Diverse Formulations for Time-Series Clustering
    3. 15.3 Online Correlation-Based Clustering
      1. 15.3.1 Selective Muscles and Related Methods
      2. 15.3.2 Sensor Selection Algorithms for Correlation Clustering
    4. 15.4 Similarity and Distance Measures
      1. 15.4.1 Univariate Distance Measures
        1. 15.4.1.1 Lp Distance
        2. 15.4.1.2 Dynamic Time Warping Distance
        3. 15.4.1.3 EDIT Distance
        4. 15.4.1.4 Longest Common Subsequence
      2. 15.4.2 Multivariate Distance Measures
        1. 15.4.2.1 Multidimensional Lp Distance
        2. 15.4.2.2 Multidimensional DTW
        3. 15.4.2.3 Multidimensional LCSS
        4. 15.4.2.4 Multidimensional Edit Distance
        5. 15.4.2.5 Multidimensional Subsequence Matching
    5. 15.5 Shape-Based Time-Series Clustering Techniques
      1. 15.5.1 k-Means Clustering
      2. 15.5.2 Hierarchical Clustering
      3. 15.5.3 Density-Based Clustering
      4. 15.5.4 Trajectory Clustering
    6. 15.6 Time-Series Clustering Applications
    7. 15.7 Conclusions
    8. Bibliography
      1. Figure 15.1
      2. Figure 15.2
      3. Figure 15.3
      4. Figure 15.4
      5. Figure 15.5
  21. Chapter 16 Clustering Biological Data
    1. 16.1 Introduction
    2. 16.2 Clustering Microarray Data
      1. 16.2.1 Proximity Measures
      2. 16.2.2 Categorization of Algorithms
      3. 16.2.3 Standard Clustering Algorithms
        1. 16.2.3.1 Hierarchical Clustering
        2. 16.2.3.2 Probabilistic Clustering
        3. 16.2.3.3 Graph-Theoretic Clustering
        4. 16.2.3.4 Self-Organizing Maps
        5. 16.2.3.5 Other Clustering Methods
      4. 16.2.4 Biclustering
        1. 16.2.4.1 Types and Structures of Biclusters
        2. 16.2.4.2 Biclustering Algorithms
        3. 16.2.4.3 Recent Developments
      5. 16.2.5 Triclustering
      6. 16.2.6 Time-Series Gene Expression Data Clustering
      7. 16.2.7 Cluster Validation
    3. 16.3 Clustering Biological Networks
      1. 16.3.1 Characteristics of PPI Network Data
      2. 16.3.2 Network Clustering Algorithms
        1. 16.3.2.1 Molecular Complex Detection
        2. 16.3.2.2 Markov Clustering
        3. 16.3.2.3 Neighborhood Search Methods
        4. 16.3.2.4 Clique Percolation Method
        5. 16.3.2.5 Ensemble Clustering
        6. 16.3.2.6 Other Clustering Methods
      3. 16.3.3 Cluster Validation and Challenges
    4. 16.4 Biological Sequence Clustering
      1. 16.4.1 Sequence Similarity Metrics
        1. 16.4.1.1 Alignment-Based Similarity
        2. 16.4.1.2 Keyword-Based Similarity
        3. 16.4.1.3 Kernel-Based Similarity
        4. 16.4.1.4 Model-Based Similarity
      2. 16.4.2 Sequence Clustering Algorithms
        1. 16.4.2.1 Subsequence-Based Clustering
        2. 16.4.2.2 Graph-Based Clustering
        3. 16.4.2.3 Probabilistic Models
        4. 16.4.2.4 Suffix Tree and Suffix Array-Based Method
    5. 16.5 Software Packages
    6. 16.6 Discussion and Summary
    7. Bibliography
      1. Figure 16.1
      2. Figure 16.2
      3. Figure 16.3
      4. Figure 16.4
  22. Chapter 17 Network Clustering
    1. 17.1 Introduction
    2. 17.2 Background and Nomenclature
    3. 17.3 Problem Definition
    4. 17.4 Common Evaluation Criteria
    5. 17.5 Partitioning with Geometric Information
      1. 17.5.1 Coordinate Bisection
      2. 17.5.2 Inertial Bisection
      3. 17.5.3 Geometric Partitioning
    6. 17.6 Graph Growing and Greedy Algorithms
      1. 17.6.1 Kernighan-Lin Algorithm
    7. 17.7 Agglomerative and Divisive Clustering
    8. 17.8 Spectral Clustering
      1. 17.8.1 Similarity Graphs
      2. 17.8.2 Types of Similarity Graphs
      3. 17.8.3 Graph Laplacians
      4. 17.8.3.1 Unnormalized Graph Laplacian
      5. 17.8.3.2 Normalized Graph Laplacians
      6. 17.8.4 Spectral Clustering Algorithms
    9. 17.9 Markov Clustering
      1. 17.9.1 Regularized MCL (RMCL): Improvement over MCL
    10. 17.10 Multilevel Partitioning
    11. 17.11 Local Partitioning Algorithms
    12. 17.12 Hypergraph Partitioning
    13. 17.13 Emerging Methods for Partitioning Special Graphs
      1. 17.13.1 Bipartite Graphs
      2. 17.13.2 Dynamic Graphs
      3. 17.13.3 Heterogeneous Networks
      4. 17.13.4 Directed Networks
      5. 17.13.5 Combining Content and Relationship Information
      6. 17.13.6 Networks with Overlapping Communities
      7. 17.13.7 Probabilistic Methods
    14. 17.14 Conclusion
    15. Acknowledgments
    16. Bibliography
      1. Figure 17.1
      1. Table 17.1
  23. Chapter 18 A Survey of Uncertain Data Clustering Algorithms
    1. 18.1 Introduction
    2. 18.2 Mixture Model Clustering of Uncertain Data
    3. 18.3 Density-Based Clustering Algorithms
      1. 18.3.1 FDBSCAN Algorithm
      2. 18.3.2 FOPTICS Algorithm
    4. 18.4 Partitional Clustering Algorithms
      1. 18.4.1 The UK-Means Algorithm
      2. 18.4.2 The CK-Means Algorithm
      3. 18.4.3 Clustering Uncertain Data with Voronoi Diagrams
      4. 18.4.4 Approximation Algorithms for Clustering Uncertain Data
      5. 18.4.5 Speeding Up Distance Computations
    5. 18.5 Clustering Uncertain Data Streams
      1. 18.5.1 The UMicro Algorithm
      2. 18.5.2 The LuMicro Algorithm
      3. 18.5.3 Enhancements to Stream Clustering
    6. 18.6 Clustering Uncertain Data in High Dimensionality
      1. 18.6.1 Subspace Clustering of Uncertain Data
      2. 18.6.2 UPStream: Projected Clustering of Uncertain Data Streams
    7. 18.7 Clustering with the Possible Worlds Model
    8. 18.8 Clustering Uncertain Graphs
    9. 18.9 Conclusions and Summary
    10. Bibliography
      1. Figure 18.1
      2. Figure 18.2
      3. Figure 18.3
      4. Figure 18.4
  24. Chapter 19 Concepts of Visual and Interactive Clustering
    1. 19.1 Introduction
    2. 19.2 Direct Visual and Interactive Clustering
      1. 19.2.1 Scatterplots
      2. 19.2.2 Parallel Coordinates
      3. 19.2.3 Discussion
    3. 19.3 Visual Interactive Steering of Clustering
      1. 19.3.1 Visual Assessment of Convergence of Clustering Algorithm
      2. 19.3.2 Interactive Hierarchical Clustering
      3. 19.3.3 Visual Clustering with SOMs
      4. 19.3.4 Discussion
    4. 19.4 Interactive Comparison and Combination of Clusterings
      1. 19.4.1 Space of Clusterings
      2. 19.4.2 Visualization
      3. 19.4.3 Discussion
    5. 19.5 Visualization of Clusters for Sense-Making
    6. 19.6 Summary
    7. Bibliography
      1. Figure 19.1
      2. Figure 19.2
      3. Figure 19.3
      4. Figure 19.4
      5. Figure 19.5
      6. Figure 19.6
      7. Figure 19.7
      8. Figure 19.8
  25. Chapter 20 Semisupervised Clustering
    1. 20.1 Introduction
    2. 20.2 Clustering with Pointwise and Pairwise Semisupervision
      1. 20.2.1 Semisupervised Clustering Based on Seeding
      2. 20.2.2 Semisupervised Clustering Based on Pairwise Constraints
      3. 20.2.3 Active Learning for Semisupervised Clustering
      4. 20.2.4 Semisupervised Clustering Based on User Feedback
      5. 20.2.5 Semisupervised Clustering Based on Nonnegative Matrix Factorization
    3. 20.3 Semisupervised Graph Cuts
      1. 20.3.1 Semisupervised Unnormalized Cut
      2. 20.3.2 Semisupervised Ratio Cut
      3. 20.3.3 Semisupervised Normalized Cut
    4. 20.4 A Unified View of Label Propagation
      1. 20.4.1 Generalized Label Propagation
      2. 20.4.2 Gaussian Fields
      3. 20.4.3 Tikhonov Regularization (TIKREG)
      4. 20.4.4 Local and Global Consistency
      5. 20.4.5 Related Methods
        1. 20.4.5.1 Cluster Kernels
        2. 20.4.5.2 Gaussian Random Walks EM (GWEM)
        3. 20.4.5.3 Linear Neighborhood Propagation
      6. 20.4.6 Label Propagation and Green’s Function
      7. 20.4.7 Label Propagation and Semisupervised Graph Cuts
    5. 20.5 Semisupervised Embedding
      1. 20.5.1 Nonlinear Manifold Embedding
      2. 20.5.2 Semisupervised Embedding
        1. 20.5.2.1 Unconstrained Semisupervised Embedding
        2. 20.5.2.2 Constrained Semisupervised Embedding
    6. 20.6 Comparative Experimental Analysis
      1. 20.6.1 Experimental Results
      2. 20.6.2 Semisupervised Embedding Methods
    7. 20.7 Conclusions
    8. Bibliography
      1. Figure 20.1
      2. Figure 20.2
      3. Figure 20.3
      4. Figure 20.4
      5. Figure 20.5
      1. Table 20.1
      2. Table 20.2
      3. Table 20.3
  26. Chapter 21 Alternative Clustering Analysis: A Review
    1. 21.1 Introduction
    2. 21.2 Technical Preliminaries
    3. 21.3 Multiple Clustering Analysis Using Alternative Clusterings
      1. 21.3.1 Alternative Clustering Algorithms: A Taxonomy
      2. 21.3.2 Unguided Generation
        1. 21.3.2.1 Naive
        2. 21.3.2.2 Meta Clustering
        3. 21.3.2.3 Eigenvectors of the Laplacian Matrix
        4. 21.3.2.4 Decorrelated k-Means and Convolutional EM
        5. 21.3.2.5 CAMI
      3. 21.3.3 Guided Generation with Constraints
        1. 21.3.3.1 COALA
        2. 21.3.3.2 Constrained Optimization Approach
        3. 21.3.3.3 MAXIMUS
      4. 21.3.4 Orthogonal Transformation Approaches
        1. 21.3.4.1 Orthogonal Views
        2. 21.3.4.2 ADFT
      5. 21.3.5 Information Theoretic
        1. 21.3.5.1 Conditional Information Bottleneck (CIB)
        2. 21.3.5.2 Conditional Ensemble Clustering
        3. 21.3.5.3 NACI
        4. 21.3.5.4 mSC
    4. 21.4 Connections to Multiview Clustering and Subspace Clustering
    5. 21.5 Future Research Issues
    6. 21.6 Summary
    7. Bibliography
  27. Chapter 22 Cluster Ensembles: Theory and Applications
    1. 22.1 Introduction
    2. 22.2 The Cluster Ensemble Problem
    3. 22.3 Measuring Similarity Between Clustering Solutions
    4. 22.4 Cluster Ensemble Algorithms
      1. 22.4.1 Probabilistic Approaches to Cluster Ensembles
        1. 22.4.1.1 A Mixture Model for Cluster Ensembles (MMCE)
        2. 22.4.1.2 Bayesian Cluster Ensembles (BCE)
        3. 22.4.1.3 Nonparametric Bayesian Cluster Ensembles (NPBCE)
      2. 22.4.2 Pairwise Similarity-Based Approaches
        1. 22.4.2.1 Methods Based on Ensemble Co-Association Matrix
        2. 22.4.2.2 Relating Consensus Clustering to Other Optimization Formulations
      3. 22.4.3 Direct Approaches Using Cluster Labels
        1. 22.4.3.1 Graph Partitioning
        2. 22.4.3.2 Cumulative Voting
    5. 22.5 Applications of Consensus Clustering
      1. 22.5.1 Gene Expression Data Analysis
      2. 22.5.2 Image Segmentation
    6. 22.6 Concluding Remarks
    7. Bibliography
      1. Figure 22.1
      2. Figure 22.2
      3. Figure 22.3
      4. Figure 22.4
      5. Figure 22.5
      1. Table 22.1
  28. Chapter 23 Clustering Validation Measures
    1. 23.1 Introduction
    2. 23.2 External Clustering Validation Measures
      1. 23.2.1 An Overview of External Clustering Validation Measures
      2. 23.2.2 Defective Validation Measures
        1. 23.2.2.1 K-Means: The Uniform Effect
        2. 23.2.2.2 A Necessary Selection Criterion
        3. 23.2.2.3 The Cluster Validation Results
        4. 23.2.2.4 The Issues with the Defective Measures
        5. 23.2.2.5 Improving the Defective Measures
    3. 23.2.3 Measure Normalization
      1. 23.2.3.1 Normalizing the Measures
      2. 23.2.3.2 The DCV Criterion
      3. 23.2.3.3 The Effect of Normalization
    4. 23.2.4 Measure Properties
      1. 23.2.4.1 The Consistency Between Measures
      2. 23.2.4.2 Properties of Measures
      3. 23.2.4.3 Discussions
    5. 23.3 Internal Clustering Validation Measures
      1. 23.3.1 An Overview of Internal Clustering Validation Measures
      2. 23.3.2 Understanding of Internal Clustering Validation Measures
        1. 23.3.2.1 The Impact of Monotonicity
        2. 23.3.2.2 The Impact of Noise
        3. 23.3.2.3 The Impact of Density
        4. 23.3.2.4 The Impact of Subclusters
        5. 23.3.2.5 The Impact of Skewed Distributions
        6. 23.3.2.6 The Impact of Arbitrary Shapes
      3. 23.3.3 Properties of Measures
    6. 23.4 Summary
    7. Bibliography
      1. Figure 23.1
      2. Figure 23.2
      3. Figure 23.3
      4. Figure 23.4
      5. Figure 23.5
      6. Figure 23.6
      7. Figure 23.7
      8. Figure 23.8
      9. Figure 23.9
      10. Figure 23.10
      11. Figure 23.11
      12. Figure 23.12
      13. Figure 23.13
      14. Figure 23.14
      15. Figure 23.15
      16. Figure 23.16
      1. Table 23.1
      2. Table 23.2
      3. Table 23.3
      4. Table 23.4
      5. Table 23.5
      6. Table 23.6
      7. Table 23.7
      8. Table 23.8
      9. Table 23.9
      10. Table 23.10
      11. Table 23.11
      12. Table 23.12
      13. Table 23.13
      14. Table 23.14
      15. Table 23.15
      16. Table 23.16
      17. Table 23.17
      18. Table 23.18
      19. Table 23.19
      20. Table 23.20
  29. Chapter 24 Educational and Software Resources for Data Clustering
    1. 24.1 Introduction
    2. 24.2 Educational Resources
      1. 24.2.1 Books on Data Clustering
      2. 24.2.2 Popular Survey Papers on Data Clustering
    3. 24.3 Software for Data Clustering
      1. 24.3.1 Free and Open-Source Software
        1. 24.3.1.1 General Clustering Software
        2. 24.3.1.2 Specialized Clustering Software
      2. 24.3.2 Commercial Packages
      3. 24.3.3 Data Benchmarks for Software and Research
    4. 24.4 Summary
    5. Bibliography