Classification trees for replicated data
In this next example from plant taxonomy, the response variable is a four-level, categorical variable called Taxon (it is a label expressed as Roman numerals I to IV). The aim is to use the measurements from the seven morphological explanatory variables to construct the best key to separate these four taxa (the ‘best’ key is the one with the lowest error rate – the key that misclassifies the smallest possible number of cases).
taxonomy<-read.table("c:\\temp\\taxonomy.txt",header=T) attach(taxonomy) names(taxonomy) [1] "Taxon" "Petals" "Internode" "Sepal" "Bract" "Petiole" [7] "Leaf" "Fruit"
Using the tree model for classification could not be simpler:
model1<-tree(Taxon~.,taxonomy)
We begin by looking at the plot of the tree:
plot(model1) text(model1)
With only a small degree of rounding on the suggested break points, the tree model suggests a simple (and for these 120 plants, completely error-free) key for distinguishing the four taxa:
1. Sepal length > 4.0 | Taxon IV |
1. Sepal length < =4.0 | 2. |
2. Leaf width > 2.0 | Taxon III |
2. Leaf width < = 2.0 | 3. |
3. Petiole length < 10 | Taxon II |
3. Petiole length > =10 | Taxon I |
The summary option for classification trees produces the following:
summary(model1) Classification tree: tree(formula = Taxon ~ ., data = taxonomy) Variables actually used in tree construction: [1] "Sepal" "Leaf" ...
Get The R Book now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.