✐

✐

“4137X˙CH08˙Akerkar” — 2007/9/14 — 15:28 — page 306 — #18

✐

✐

✐

✐

✐

✐

306 CHAPTER 8 Web Structure Mining

8.2.2 Hyperlink-Induced Topic Search (HITS)

The taxonomy of diﬀerent types of hyperlinks that can be found on the Web was suggested

around 1997. The fruitful research outcome of the study was the understanding of the popu-

larity and the importance of a page and its correlation to the number of incoming links. One

can use this information to sort the query results of a search engine. The in-degree alone is

a poor measure of importance because many pages are frequently pointed to without being

connected to the contents of the referring page.

Kleinberg described a method for web information retrieval using the concepts of hubs

and authorities (Kleinberg 1998, 1998a). This is a detailed evaluation of the importance of

web pages using a variant of the eigenvector calculation used for PageRank. He described hubs

as web pages with good sources of links and authorities as web pages with good sources of

content. For instance, the authorities are your course home pages, whereas the hub is your

college course-listing page. Hubs and authorities exhibit a mutually reinforcing relationship.

A good hub is one that points to many good authorities and a good authority is one that is

pointed to from many good hubs.

Given any topic query speciﬁed by a query string σ, we can decide on authoritative pages

by an analysis of the link structure, but we must initially determine the subgraph of the WWW

on which the algorithm will operate. We would like to have the subgraph somewhat small, rich

in relevant pages, and containing several of the strongest authorities.

We can construct a root set R

σ

of the World Wide Web by accumulating the t highest-

ranked pages for the query σ from a text-based search engine such as Google or Yahoo for a

particular parameter t. This set of pages is somewhat small and rich in relevant pages, but does

not consistently contain several of the strongest authorities. Even though a strong authority

for the query topic might not be in the root set R

σ

, it is likely to be pointed to by at least

one page in R

σ

. We can expand this root set R

σ

to produce a set of pages S

σ

that fulﬁlls the

criteria by taking any Web pages pointed to by a page in R

σ

.

Now, we correlate a non-negative authority weight AUTH

(p)

and a non-negative hub

weight HUB

(p)

with each page p. The weights are normalized so that

p∈S

σ

(AUT H

(p)

)

2

=1

p∈S

σ

(HUB

(p)

)

2

=1

The normalization can be accomplished by dividing the AUTH and HUB values by the square

root of the sum of their squares,

normalizedAUT H

(p)

=

AUT H

(p)

u∈S

σ

AUT H

(u)

2

Then pages with larger AUTH and HUB values are viewed as “better” authorities and hubs

respectively. This can be done using the iterative algorithm shown in

Figure 8.14.

✐

✐

“4137X˙CH08˙Akerkar” — 2007/9/14 — 15:28 — page 307 — #19

✐

✐

✐

✐

✐

✐

8.2 Modeling Web Topology 307

HITS Algorithm: Assume that there are n linked pages.

Let S

σ

=(V,E). (V = set of pages, E = set of hyperlinks between pages)

Initialize

"

HUB =

"

AUTH = (1, 1,... ,1) ∈ R

n

Repeat until

"

HUB and

"

AUTH converge (i.e., stabilize or do not change);

Normalize

"

HUB and

"

AUTH .

For all pages p ∈ V,

HUB

(p)

=

q:(p,q)∈E

AUTH

(q )

AUTH

(p)

=

q:(q,p)∈E

HUB

(q )

Return

"

HUB and

"

AUTH

Figure 8.14 HITS algorithm

References

Example 8.7 Apply the previous algorithm to the network from Figure 8.15.

After initializing the

"

HUB and

"

AUT H uniformly to 1 and normalizing the two vectors,

we get the following values:

Document = A HUB = 0.58 AUTH = 0.58

Document = B HUB = 0.58 AUTH = 0.58

Document = C HUB = 0.58 AUTH = 0.58

The new HUB value of document A will be the sum of AUTH values of B and C: 1.16.

The new HUB values of both B and C will be equal to 0, because they are not pointing

to any page.

The new AUTH values of A will be 0, because no page is pointing to it.

The new AUTH values of both B and C will be equal to the HUB value of A: 0.58.

BA

C

Figure 8.15 Web graph used in Example 8.7

Get *Building an Intelligent Web: Theory and Practice* now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.