Alternatives to Offering Deep-Web Access

There are two common approaches to offering access to deep-web content. The first approach, popular with vertical search engines, is to create mediators for specific domains (e.g., cars, books, or real estate). In this approach we could create a single master form (the mediator) and then create semantic mappings between individual forms and the mediator. For each query over the mediator, the relevant underlying forms are selected based on some precomputed form summaries. The semantic mappings are used to construct queries over each individual form. Content is then retrieved from each of the selected forms and then combined before presenting them to a user. At a high level, this approach is very similar in spirit to the implementation of modern comparative shopping portals that retrieve offers from multiple underlying sites using web services.

Although adequate for vertical search, which focuses on homogenous collections of forms within a single domain, this approach is unsuitable for a general-purpose search engine. First, the human cost of building and maintaining the many different mediators and mappings is high. Second, identifying the forms that are most relevant to a search engine keyword query is extremely challenging. Only a small number of forms have to be identified; otherwise, the underlying forms can receive user traffic more than they can possibly handle. To achieve this, at the extreme, the form summaries might need to be almost ...

Get Beautiful Data now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.