Web Structure Mining
This chapter explains basic concepts about Web mining and focuses on
Web structure mining. After bibliometrics as a preliminary stage of Web
structure mining is introduced, methods for computing values of academic
researchers and Web pages are described as Web structure mining.
10.1 Web Mining
Data-intensive Web systems typically consist of Web contents (i.e., Web
pages), Web access logs (i.e., user access histories) in a Web server and
databases at its back end (see Fig. 10.1). In other words, such a Web system
as a whole constitutes a Web database system. Among such data, Web
contents are modeled as graph structures where pages and links (hyperlinks)
correspond to nodes and edges, respectively. Web contents in a narrower
sense may represent multimedia data such as texts and photos within the
Web pages except the links. The author takes this position.
Figure 10.1 An architecture for Web database.
150 Social Big Data Mining
Here let the targets of Web mining be links, texts in pages, and access
logs. Mining multimedia data and databases will be described separately.
Therefore Web mining is roughly classified into the following three
categories according to its main target:
1. Structure mining targets graph structures of Web pages (i.e., link
2. Contents mining targets contents of Web pages (i.e., texts).
3. History mining targets Web access logs.
Please note that some researches or technologies may not be strictly
classiﬁ ed into only one category.
In this section, technologies which find meaningful patterns or
structures by paying attention to the graph structures of Web pages will
be explained in order as follows:
• bibliometrics (an impact factor and h-index)
• Web link analysis (prestige, PageRank, and HITS)
10.2 Structure Mining
Bibliometrics has been a scientiﬁ c ﬁ eld since before the Web emerged and
it aims at identifying inﬂ uential writings (especially academic books and
papers) and authors and also the relationships between them through
quantitative analysis of writings and authors. Bibliometrics has invented
at least the following concepts and laws until now.
• The law of Lotka is a statistic law about the writing productivity of an
• The law of Zipf is a statistic law about the contents of writings.
• The number of times that a certain writing is cited by another writing
is deeply related to the inﬂ uence of the cited writing.
• Co-citation means that two or more writings simultaneously cited by
another writing (i.e., two or more writings which coincide in citation)
can be used for measuring similarity between the cited writings in the
• Co-reference means that two or more writings citing another writing
in common can be used for measuring similarity between the citing
writings in that case.
• An impact factor is calculated by analyzing the times of citations
of writings published at the academic journal and can be used for
measuring the inﬂ uence of the journal based on the results.
Web Structure Mining 151
If writings and citations are extended to pages and links, respectively,
the above laws and concepts can be used for analysis of the Web as well as
writings. First the laws of Lotka and of Zipf will be explained brieﬂ y. The
remaining concepts relevant to citations will be described in detail later at
(1) The law of Lotka
This is a statistic law about the frequency distribution of authors’
productivity. Let P be the number of the writings which an author published
and let A be the frequency of such an author, then the following empirical
Here, c is a positive number (around 2). The law of Lotka insists that
the more writings an author publishes, the smaller the frequency of such
an author is (see Fig. 10.2a).
Figure 10.2 (a) The law of Lotka, (b) The law of Zipf.
(2) The law of Zipf
This is a statistic law about the frequency distribution of words which
appear in a writing. Let R be the rank of a word used in the writing and let
W be the frequency, then the following empirical rule holds.