“4137X˙CH04˙Akerkar” 2007/9/17 11:02 page 162 #44
162 CHAPTER 4 Classication and Association
the rules seem to be a bit more generalized than the decision tree to increase the support
for a rule at the expense of confidence in the rule. Also note that this is probably not the
best classification model. Readers are encouraged to change the options from the UserInput
worksheet and analyze the regenerated decision tree. A good project will involve creating a
larger database from the KDD Cup data and creating a more elaborate classification model
(see Exercises).
4.4 Association
In data mining, the necessity of deriving associations from data has received considerable
attention. The solution was formulated by Agrawal, Imielinski, and Swami (1993), and is
well known as market-basket analysis. In this approach, we can think of a set of items and a
large collection of transactions, which are subsets (baskets) of these items. The task is to find
relationships between several items within these baskets. One can think of this type of data in
terms of a matrix of p rows corresponding to baskets and q columns corresponding to items.
Such a matrix can be very large. The notion of association rules was invented as a means to
efficiently find simple patterns in such data.
There are many applications of data mining that fit into this context. The typical example
from which the analysis gets its name is the shopping mall. In this situation, the objective
is to analyze customers’ buying habits by finding associations between the different items
that they place in their shopping baskets. The discovery of such association rules can help
the retailer develop marketing strategies by gaining insight into matters such as which items
are most frequently purchased by customers. It also helps in inventory management and sales
promotion strategies.
It is generally accepted that the discovery of association rules is dependent only on the
discovery of frequent sets. The popular algorithms are thus concerned with determining the
set of frequent itemsets in a given set of operation databases. The procedure is basically to
compute the frequency of occurrences of each itemset in the database. As the total number
of itemsets is exponential in terms of the number of items, it is not possible to count the
frequencies of these sets by reading the database in just one pass; moreover a single pass will
need numerous counters. Consequently, having multiple passes for generating all the frequent
itemsets is unavoidable. Therefore, distinct algorithms for the discovery of association rules
point at dropping the number of passes by generating candidate sets, which are probable
frequent sets. The algorithms vary from one another in the method of handling the candidate
sets and the method of reducing the number of database passes. However, there are some new
methods that attempt to detect frequent sets without having to generate candidate sets.
In this perspective, we can see whether it is possible to generate association rules incre-
mentally. The algorithms for discovering frequent sets are not directly suitable in the situation
where the underlying database is incremented sporadically. The objective is to prevent com-
puting the frequent sets again for the incremented set of data. The notion of border sets turns
out to be very important in this perspective, because they can be determined without addi-
tional effort. Additionally, if no border set is eliminated by a database update, it is shown that
the set of frequent sets can be found without having to search through the whole database.
In the real world, the number of frequent sets is enormous; thus the number of association
rules is also too large to be useful. It turns out to be significant to generate only those associ-

Get Building an Intelligent Web: Theory and Practice now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.