We have seen two extreme learning paradigms so far. The setting in Chapter 4 was unsupervised: only a collection of documents was provided without any labels, and the system was supposed to propose a grouping of the documents based on similarity. In contrast, Chapter 5 considered the completely supervised setting where each object was tagged with a class. Real-life applications are somewhere in between. It is generally easy to collect unsupervised data: every time Google completes a crawl, a collection of over a billion documents is created. On the other hand, labeling is a laborious job, which explains why the size and reach of Yahoo! and the Open Directory lag behind the size of the Web.

Consider a document ...

Get Mining the Web now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.