book

Semi-Supervised and Unsupervised Machine Learning: Novel Strategies

by Amparo Albalate, Wolfgang Minker

January 2011

Intermediate to advanced

320 pages

4h 50m

English

Wiley

Read now

Unlock full access

Cover
Title Page
Copyright
Part 1: State of the Art
Chapter 1: Introduction
1.1. Organization of the book1.2. Utterance corpus1.3. Datasets from the UCI repository1.3.1. Wine dataset (wine)1.3.2. Wisconsin breast cancer dataset (breast)1.3.3. Handwritten digits dataset (Pendig)1.3.4. Pima Indians diabetes (diabetes)1.3.5. Iris dataset (Iris)1.4. Microarray dataset1.5. Simulated datasets1.5.1. Mixtures of Gaussians1.5.2. Spatial datasets with non-homogeneous inter-cluster distance
Chapter 2: State of the Art in Clustering and Semi-Supervised Techniques
2.1. Introduction2.2. Unsupervised machine learning (clustering)2.3. A brief history of cluster analysis2.4. Cluster algorithms2.4.1. Hierarchical algorithms2.4.1.1. Agglomerative clustering2.4.1.1.1. Comparison of agglomerative criteria2.4.1.2. Divisive algorithms2.4.2. Model-based clustering2.4.2.1. The expectation maximization (EM) algorithm2.4.2.1.1. Example: mixtures of Gaussians2.4.3. Partitional competitive models2.4.3.1. K-means2.4.3.1.1. Advantages and drawbacks2.4.3.2. Neural gas2.4.3.2.1. Advantages and drawbacks2.4.3.3. Partitioning around Medoids (PAM)2.4.3.3.1. Build step2.4.3.3.2. Swap phase2.4.3.3.3. Advantages and drawbacks2.4.3.4. Self-organizing maps2.4.3.4.1. Advantages and drawbacks2.4.4. Density-based clustering2.4.4.1. Direct density reachability2.4.4.2. Density reachability2.4.4.3. Density connection2.4.4.4. Border points2.4.4.5. Noise points2.4.4.6. DBSCAN algorithm2.4.4.6.1. Advantages and drawbacks2.4.5. Graph-based clustering2.4.5.1. Pole-based overlapping clustering2.4.5.1.1. Definition of a dissimilarity graph2.4.5.1.2. Pole construction2.4.5.1.3. Pole restriction2.4.6. Affectation stage2.4.6.1. Advantages and drawbacks2.5. Applications of cluster analysis2.5.1. Image segmentation2.5.2. Molecular biology2.5.2.1. Biological considerations2.5.3. Information retrieval and document clustering2.5.3.1. Document pre-processing2.5.3.1.1. Word selection2.5.3.1.2. Stop word filtering2.5.3.1.3. Word lemmatizing/stemming2.5.3.2. Boolean model representation2.5.3.3. Vector space model2.5.3.4. Term weighting2.5.3.4.1. Term frequency component2.5.3.4.2. Collection frequency component2.5.3.4.3. Length normalization component2.5.3.5. Probabilistic models2.5.3.5.1. Binary independence retrieval model2.5.3.5.2. The 2-Poisson model2.5.3.5.3. Okapi weighting2.5.4. Clustering documents in information retrieval2.5.4.1. Clustering of presented results2.5.4.2. Post-retrieval document browsing (Scatter-Gather)2.6. Evaluation methods2.7. Internal cluster evaluation2.7.1. Entropy2.7.2. Purity2.7.3. Normalized mutual information2.8. External cluster validation2.8.1. Hartigan2.8.2. Davies Bouldin index2.8.3. Krzanowski and Lai index2.8.4. Silhouette2.8.5. Gap statistic2.9. Semi-supervised learning2.9.1. Self training2.9.2. Co-training2.9.3. Generative models2.10. Summary
Part 2: Approaches to Semi-Supervised Classification
Chapter 3: Semi-Supervised Classification Using Prior Word Clustering
3.1. Introduction3.2. Dataset3.3. Utterance classification scheme3.3.1. Pre-processing3.3.1.1. Utterance vector representation3.3.2. Utterance classification3.4. Semi-supervised approach based on term clustering3.4.1. Term clustering3.4.2. Semantic term dissimilarity3.4.2.1. Term vector of lexical co-occurrences3.4.2.2. Metric of dissimilarity3.4.3. Term vector truncation3.4.4. Term clustering3.4.5. Feature extraction and utterance feature vector3.4.6. Evaluation3.5. Disambiguation3.5.1. Evaluation3.6. Summary
Chapter 4: Semi-Supervised Classification Using Pattern Clustering
4.1. Introduction4.2. New semi-supervised algorithm using the cluster and label strategy4.2.1. Block diagram4.2.1.1. Dataset4.2.1.2. Clustering4.2.1.3. Optimum cluster labeling4.2.1.4. Classification4.3. Optimum cluster labeling4.3.1. Problem definition4.3.2. The Hungarian algorithm4.3.2.1. Weighted complete bipartite graph4.3.2.2. Matching, perfect matching and maximum weight matching4.3.2.3. Objective of Hungarian method4.3.2.4. Complexity considerations4.3.3. Genetic algorithms4.3.3.1. Reproduction operators4.3.3.1.1. Crossover4.3.3.1.2. Mutation4.3.3.2. Forming the next generation4.3.3.2.1. Generational replacement4.3.3.2.2. Elitism with generational replacement4.3.3.2.3. Steady state representation4.3.3.3. GAs applied to optimum cluster labeling4.3.3.4. Comparison of methods4.4. Supervised classification block4.4.1. Support vector machines4.4.1.1. The kernel trick for nonlinearly separable classes4.4.1.2. Multi-class classification4.4.2. Example4.5. Datasets4.5.1. Mixtures of Gaussians4.5.2. Datasets from the UCI repository4.5.2.1. Iris dataset (Iris)4.5.2.2. Wine dataset (wine)4.5.2.3. Wisconsin breast cancer dataset (breast)4.5.2.4. Handwritten digits dataset (Pendig)4.5.2.5. Pima Indians diabetes (diabetes)4.5.3. Utterance dataset4.6. An analysis of the bounds for the cluster and label approaches4.7. Extension through cluster pruning4.7.1. Determination of silhouette thresholds4.7.2. Evaluation of the cluster pruning approach4.8. Simulations and results4.9. Summary
Part 3: Contributions to Unsupervised Classification — Algorithms to Detect the Optimal Number of Clusters

Chapter 5: Detection of the Number of Clusters through Non-Parametric Clustering Algorithms
5.1. Introduction5.2. New hierarchical pole-based clustering algorithm5.2.1. Pole-based clustering basis module5.2.2. Hierarchical pole-based clustering5.3. Evaluation5.3.1. Cluster evaluation metrics5.4. Datasets5.4.1. Results5.4.2. Complexity considerations for large databases5.5. Summary
Chapter 6: Detecting the Number of Clusters through Cluster Validation
6.1. Introduction6.2. Cluster validation methods6.2.1. Dunn index6.2.2. Hartigan6.2.3. Davies Bouldin index6.2.4. Krzanowski and Lai index6.2.5. Silhouette6.2.6. Hubert’s γ6.2.7. Gap statistic6.3. Combination approach based on quantiles6.4. Datasets6.4.1. Mixtures of Gaussians6.4.2. Cancer DNA-microarray dataset6.4.3. Iris dataset6.5. Results6.5.1. Validation results of the five Gaussian dataset6.5.2. Validation results of the mixture of seven Gaussians6.5.3. Validation results of the NCI60 dataset6.5.4. Validation results of the Iris dataset6.5.5. Discussion6.6. Application of speech utterances6.7. Summary
Bibliography
Index

Overview

This book provides a detailed and up-to-date overview on classification and data mining methods. The first part is focused on supervised classification algorithms and their applications, including recent research on the combination of classifiers. The second part deals with unsupervised data mining and knowledge discovery, with special attention to text mining. Discovering the underlying structure on a data set has been a key research topic associated to unsupervised techniques with multiple applications and challenges, from web-content mining to the inference of cancer subtypes in genomic microarray data. Among those, the book focuses on a new application for dialog systems which can be thereby made adaptable and portable to different domains. Clustering evaluation metrics and new approaches, such as the ensembles of clustering algorithms, are also described.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781118586136Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills