Chapter 5Bias-Variance Trade-Off
A machine [classifier] with too much capacity [ability to fit training data exactly] is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, then it’s a tree. Neither can generalize well.
— Christopher J. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, 1998
Recall from Chapter 2 that an approximation method is a function that maps a training dataset
to an approximation
, and the risk of an approximation method is the expected loss with respect to the distribution of new data and of training datasets,

The risk of an approximation method decomposes in an informative way when squared-error loss is used. Specifically, under square-error loss, risk decomposes into a sum of three nonnegative terms, one of which we can do nothing about and two of which we can affect. As we shall see in Chapter 6, viewing risk-minimization as minimization of the sum of two nonnegative terms, and having useful ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access