“4137X˙CH08˙Akerkar” 2007/9/14 15:28 page 306 #18
306 CHAPTER 8 Web Structure Mining
8.2.2 Hyperlink-Induced Topic Search (HITS)
The taxonomy of different types of hyperlinks that can be found on the Web was suggested
around 1997. The fruitful research outcome of the study was the understanding of the popu-
larity and the importance of a page and its correlation to the number of incoming links. One
can use this information to sort the query results of a search engine. The in-degree alone is
a poor measure of importance because many pages are frequently pointed to without being
connected to the contents of the referring page.
Kleinberg described a method for web information retrieval using the concepts of hubs
and authorities (Kleinberg 1998, 1998a). This is a detailed evaluation of the importance of
web pages using a variant of the eigenvector calculation used for PageRank. He described hubs
as web pages with good sources of links and authorities as web pages with good sources of
content. For instance, the authorities are your course home pages, whereas the hub is your
college course-listing page. Hubs and authorities exhibit a mutually reinforcing relationship.
A good hub is one that points to many good authorities and a good authority is one that is
pointed to from many good hubs.
Given any topic query specified by a query string σ, we can decide on authoritative pages
by an analysis of the link structure, but we must initially determine the subgraph of the WWW
on which the algorithm will operate. We would like to have the subgraph somewhat small, rich
in relevant pages, and containing several of the strongest authorities.
We can construct a root set R
σ
of the World Wide Web by accumulating the t highest-
ranked pages for the query σ from a text-based search engine such as Google or Yahoo for a
particular parameter t. This set of pages is somewhat small and rich in relevant pages, but does
not consistently contain several of the strongest authorities. Even though a strong authority
for the query topic might not be in the root set R
σ
, it is likely to be pointed to by at least
one page in R
σ
. We can expand this root set R
σ
to produce a set of pages S
σ
that fulfills the
criteria by taking any Web pages pointed to by a page in R
σ
.
Now, we correlate a non-negative authority weight AUTH
(p)
and a non-negative hub
weight HUB
(p)
with each page p. The weights are normalized so that
pS
σ
(AUT H
(p)
)
2
=1
pS
σ
(HUB
(p)
)
2
=1
The normalization can be accomplished by dividing the AUTH and HUB values by the square
root of the sum of their squares,
normalizedAUT H
(p)
=
AUT H
(p)
uS
σ
AUT H
(u)
2
Then pages with larger AUTH and HUB values are viewed as “better” authorities and hubs
respectively. This can be done using the iterative algorithm shown in
Figure 8.14.
“4137X˙CH08˙Akerkar” 2007/9/14 15:28 page 307 #19
8.2 Modeling Web Topology 307
HITS Algorithm: Assume that there are n linked pages.
Let S
σ
=(V,E). (V = set of pages, E = set of hyperlinks between pages)
Initialize
"
HUB =
"
AUTH = (1, 1,... ,1) R
n
Repeat until
"
HUB and
"
AUTH converge (i.e., stabilize or do not change);
Normalize
"
HUB and
"
AUTH .
For all pages p V,
HUB
(p)
=
q:(p,q)E
AUTH
(q )
AUTH
(p)
=
q:(q,p)E
HUB
(q )
Return
"
HUB and
"
AUTH
Figure 8.14 HITS algorithm
References
Example 8.7 Apply the previous algorithm to the network from Figure 8.15.
After initializing the
"
HUB and
"
AUT H uniformly to 1 and normalizing the two vectors,
we get the following values:
Document = A HUB = 0.58 AUTH = 0.58
Document = B HUB = 0.58 AUTH = 0.58
Document = C HUB = 0.58 AUTH = 0.58
The new HUB value of document A will be the sum of AUTH values of B and C: 1.16.
The new HUB values of both B and C will be equal to 0, because they are not pointing
to any page.
The new AUTH values of A will be 0, because no page is pointing to it.
The new AUTH values of both B and C will be equal to the HUB value of A: 0.58.
BA
C
Figure 8.15 Web graph used in Example 8.7

Get Building an Intelligent Web: Theory and Practice now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.