IN THIS CHAPTER
Explaining how correct sampling is critical in machine learning
Highlighting errors dictated by bias and variance
Proposing different approaches to validation and testing
Warning against biased samples, overfitting, underfitting, and snooping
Having examples (in the form of datasets) and a machine learning algorithm at hand doesn’t assure that solving a learning problem is possible or that the results will provide the desired solution. For example, if you want your computer to distinguish a photo of a dog from a photo of a cat, you can provide it with good examples of dogs and cats. You then train a dog versus cat classifier based on some machine learning algorithm that could output the probability that a given photo is a dog or a cat. Of course, the output is a probability — not an absolute assurance that the photo is a dog or cat.
Based on the probability that the classifier reports, you can decide the class (dog or cat) of a photo based on the estimated probability calculated by the algorithm. When the probability is higher for a dog, you can minimize the risk of making a wrong assessment by choosing the higher chances favoring a dog. The greater the probability difference between the likelihood of a dog against that of a cat, the higher the confidence you can have in your choice. A close choice likely occurs because of some ambiguity in the photo (the photo is not clear or the dog is actually a bit cattish). For that ...