Feature Selection and Evaluation
Selecting the right features for classification is a major task in all areas of pattern matching and machine learning. This is a very difficult problem. In practice, adding a new feature to an existing feature vector may increase or decrease performance depending on the features already present. The search for the perfect vector is an NP-complete problem. In this chapter, we will discuss some common techniques that can be adopted with relative ease.
13.1 Overfitting and Underfitting
In order to get optimal classification accuracy, the model must have just the right level of complexity. Model complexity is determined by many factors, one of which is the dimensionality of the feature space. The more features we use, the more degrees of freedom we have to fit the model, and the more complex it becomes.
To understand what happens when a model is too complex or too simple, it is useful to study both the training error rate and the testing error rate . We are familiar with the testing error rate from Chapter 10, where we defined the accuracy as , while the training error rate is obtained by testing the classifier on the training set. Obviously, ...