Web Access Log Mining,
Information Extraction, and
Deep Web Mining
First in this chapter the basic techniques for Web access log mining and
the applications including recommendation, site design improvement,
collarborative fi ltering, and Web personalization will be described. Next,
the techniques for extracting information from the generic Web and mining
the deep Web including social data, will be explained.
12.1 Web Access Log Mining
12.1.1 Access Log Mining and Recommendation
Web access log mining is to analyze the access histories of the users who
accessed a website [Liu 2007]. The results of the analysis are mainly used
for recommendation of the pages to other users or to re-design the website.
When human users and so-called Web robots access Web sites, an entry
including the IP address, the accessed time, the demanded page, the browser
name (i.e., agent), the page visited just before visiting the current page, and
search terms is recorded in the Web access log (see Fig. 12.1) [Ishikawa et
al. 2003].
After data cleansing such as removal of unnecessary Web robots’
histories is performed for the access log, sessions are extracted from the log.
Then user models are created by classifying or clustering users.
Basically, whether the visitor is a human or a Web robot can be known
by the records as to whether the visitor accessed the fi le called robots.txt.
This is because the Web robots are recommended to follow the policy of
the Web site about acceptance of robots (i.e., robot exclusion agreement)
186 Social Big Data Mining
described by the fi le. Moreover, the distinction between human visitors and
Web robots can also be made by checking whether the visitors are included
in the robot list which has been created in advance. However, these methods
are not effective against malicious Web robots or new Web robots which
have not been registered yet. In that case, it is necessary to detect such Web
robots by their access patterns and this task itself will be a kind of Web
access log mining [Tan et al. 2002]. Anyhow, assume for simplicity that the
Web access log is cleansed, that is, Web robot accesses are removed from
the access log by a certain method.
A sequence of page accesses by the same user is called session. Basically
whether the visitor is the same user or not is judged by its IP address.
Generally, however, there is no guarantee that the same IP address (for
example, dynamic IP address) represents the same user. Therefore, in order
to correctly identify the same user, it may be necessary to combine other
information (for example, agent).
It is usually assumed that the time interval between one access and the
subsequent access within a session is less than 30 minutes. Thus, a session
is united. As for other methods of extracting a session from the access log,
it is possible to use a threshold value for the duration of a whole session
or to include page accesses in a session under consideration which contain
its preceding page access.
How to extract access patterns based on transition probability will be
explained below.
The transition probability P from the page A to the page B (denoted by
A ² B) can be calculated as follows (see Fig. 12.2).
P(A ² B) = {the number of transitions from A to B}/{the total number
of transitions from A}
Furthermore, the transition probability P of the path of pages (A ² B
² C) is calculated as follows.
P(A ² B ² C) = P(A ² B) × P(B ² C)
Figure 12.1 An example of web access log data.

Get Social Big Data Mining now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.