We have seen two extreme learning paradigms so far. The setting in Chapter 4 was unsupervised: only a collection of documents was provided without any labels, and the system was supposed to propose a grouping of the documents based on similarity. In contrast, Chapter 5 considered the completely supervised setting where each object was tagged with a class. Real-life applications are somewhere in between. It is generally easy to collect unsupervised data: every time Google completes a crawl, a collection of over a billion documents is created. On the other hand, labeling is a laborious job, which explains why the size and reach of Yahoo! and the Open Directory lag behind the size of the Web.

Consider a document ...

Get Mining the Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.