Chapter 3

Semi-Supervised Classification Using Prior Word Clustering

3.1. Introduction

In the first part of this book, two semi-supervised approaches have been developed, which exploit the availability of unlabeled data by means of unsupervised clustering.

As stated in Chapter 1, cluster and label approaches in the machine learning literature often merge the cluster and label steps as a global optimization problem in which both tasks are simultaneously solved. The approaches developed in this book are intended to avoid the influence of the labeled seeds on the cluster solution, which can induce wrong clustering if potential labeling errors are present in the labeled sets.

In particular, the algorithm described in this chapter is based on the synonymy assumption: under minimal class labels, the underlying classes can be approximately recovered by extracting semantic similarities from the data. The synonymy assumption has been applied to the feature space in which a clustering algorithm is used to extract groups of synonym words. In previous research [LI 98], word clustering has been typically investigated as a feature clustering strategy for supervised text classification. By applying word clustering for supervised classification, a degradation in the classification accuracy was reported. However, in the experiments described in this chapter, it is shown that a classification algorithm using minimum labeled seeds can benefit from word clustering on unlabeled data (semi-supervised ...

Get Semi-Supervised and Unsupervised Machine Learning: Novel Strategies now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.