
102
Big Data: Storage, Sharing, and Security
performance of a deep web crawler. For instance, suppose that there are two data sources A
and B containing 1,000 and 100,000 documents, respectively. For crawling A, the cost caused
by a crawler is 5,000 (matched documents) with 100% coverage. Meanwhile, for B, there are
500,000 matched documents retrieved by the same crawler with 100% coverage. Then, the
performance of the crawler on A and B is identical since both overlapping rates from A and B
are 5.
More formally, given a document-term bipartite graph G =(D,Q, E), a set of queries Q
s
⊆
Q selected by a crawler forms a subgraph denoted by G
s
=(D
s
,Q
s
,E