
Challenges in Crawling the Deep Web
107
increase of P, the cost increases at a faster speed. If we are harvesting as much data as possible
from many data sources, instead of exhaustively siphoning all the data records from one single
data source, Equation 4.2 gives a guideline as for when it is the good time to jump to another
data source for a fixed crawling resource.
Since d can be calculated easily from the crawling history, Equation 4.6 is particularly
useful to estimate how much data have been downloaded and when the crawling process
will stop. Another surprising observation we can make is that large queries induce the same
overlapping rate