“4137X˙CH06˙Akerkar” 2007/9/8 11:13 page 254 #50
254 CHAPTER 6 Web Usage Mining
6.5.4 Sequence-Pattern Analysis of Web Logs
The data in web-access logs are intrinsically sequential. We used data-visualization techniques
to look at aggregate navigation using Pathalizer as well as navigation in individual sessions
using StatViz. Applying data-mining techniques to analyze sequences of web requests is an
important area of research (Cadez et al. 2000; Iv´ancsy and Vajk 2006). Many of the techniques
involved are at an experimental stage and contain sophisticated mathematical analysis. In this
section, we will first look at some preliminary analytical techniques and review some of the
more esoteric ones.
Cadez (2000) presented msnbc.com (2000) anonymous web data, which can be downloaded
from http://kdd.ics.uci.edu/databases/msnbc/msnbc.html. It is also available on the CD under
Chapter 6 in a folder called msnbc. Copy the folder to your hard drive under the folder for
Chapter 6. The file msnbc990928.seq contains the original data, which comes from the web-
access logs of msnbc.com and news portions of msn.com for the 24-hour period on September,
28, 1999. There was a total of 989,818 user sessions. The data is anonymized, so we have no
knowledge of the login details of the users. The first 20 lines of the file msnbc990928.seq are
shown in
Figure 6.28. (Note that the third line wraps around twice, so it looks like there are
22 lines.) The first seven lines give us information about the data. The third line lists various
news categories on the site. These categories are referred to in the data as numbers.
Table 6.10 shows the categories and corresponding numeric code. The data consists of a
sequence of numbers starting at line 8. Each sequence in the dataset corresponds to a user’s
web request. We know the only category of the web page that was requested by the user; we
do not know the name of the actual page. The reporting of categories of the pages as opposed
% Different categories found in input file:
frontpage news tech local opinion on-air misc weather
msn-news health living business msn-sports sports
summary bbs travel
% Sequences:
1 1
3 2 2 4 2 2 2 3 3
1 1
6 7 7 7 6 6 8 8 8 8
6 9 4 4 4 10 3 10 5 10 4 4 4
1 1 1 11 1 1 1
12 12
1 1
Figure 6.28 First twenty lines of msnbc.com data (Note: The third line wraps around twice)
“4137X˙CH06˙Akerkar” 2007/9/8 11:13 page 255 #51
6.5 Web-Usage Mining Applications 255
Category Number
frontpage 1
news 2
tech 3
local 4
opinion 5
on-air 6
misc 7
weather 8
msn-news 9
health 10
living 11
business 12
msn-sports 13
sports 14
summary 15
bbs 16
travel 17
Table 6.10 Categories and corresponding numbers for msnbc.com data
to the actual pages, in fact, simplifies our job. There are anywhere from 10 to 5,000 pages
per category. It would be difficult to keep track of each one of these pages. Average length of
the sequences is 5.7. As with any other web-access logs, any page request served via a caching
mechanism could not be recorded in the data.
One of the most useful pieces of information in the web-access logs is the sequence in which
users access pages. This information can be used to provide appropriate links to simplify the
navigation. One can do a frequency analysis of all the category pairs, such as (1,1), (1,2), (1,3),
... , (17,1), (17,2), ... , (17,3). In total, there are 289 pairs. The sequence of category numbers
in a pair is important for two reasons:
The links are always between a pair of pages. Thus, knowing which pages are requested
from a given page is the most relevant information needed in order to determine the
navigational links.
The pairs of sequences will have the highest frequency. For example, a sequence (i, j, k)
cannot have a higher frequency than either of the pairs, (i, j) or (j, k).
Moreover, the number of pairs is much smaller than any longer sequence. We can use pattern-
matching programs such as grep (available for various platforms including UNIX and DOS)
to look for sequences of pairs.

Get Building an Intelligent Web: Theory and Practice now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.