This appendix is reproduced (with slight modifications and corrections) from a post, of the same name, from the author’s blog.
In the paper “The unreasonable effectiveness of data,”1 Halevy, Norvig, and Pererira, all from Google, argue that interesting things happen when corpora get to web scale:
simple models and a lot of data trump more elaborate models based on less data.
In that paper and the more detailed tech talk given by Norvig, they demonstrate that when corpora get to hundreds of millions or trillions of training sample or words, very simple models with basic independence assumptions can outperform more complex models such as those based on carefully-crafted ontologies with smaller data. However, they provided relatively little explanation as to why more data is better. In this appendix, I want to attempt to think through that.
I propose that there are several classes of problems and reasons for why more data is better.
The first are nearest neighbor type problems. Halevy et al. give an example:
James Hays and Alexei A. Efros addressed the task of scene completion: removing an unwanted, unsightly automobile or ex-spouse from a photograph and filling in the background with pixels taken from a large corpus of other photos.2