✐

✐

“4137X˙CH04˙Akerkar” — 2007/9/17 — 11:02 — page 162 — #44

✐

✐

✐

✐

✐

✐

162 CHAPTER 4 Classiﬁcation and Association

the rules seem to be a bit more generalized than the decision tree to increase the support

for a rule at the expense of conﬁdence in the rule. Also note that this is probably not the

best classiﬁcation model. Readers are encouraged to change the options from the UserInput

worksheet and analyze the regenerated decision tree. A good project will involve creating a

larger database from the KDD Cup data and creating a more elaborate classiﬁcation model

(see Exercises).

4.4 Association

In data mining, the necessity of deriving associations from data has received considerable

attention. The solution was formulated by Agrawal, Imielinski, and Swami (1993), and is

well known as market-basket analysis. In this approach, we can think of a set of items and a

large collection of transactions, which are subsets (baskets) of these items. The task is to ﬁnd

relationships between several items within these baskets. One can think of this type of data in

terms of a matrix of p rows corresponding to baskets and q columns corresponding to items.

Such a matrix can be very large. The notion of association rules was invented as a means to

eﬃciently ﬁnd simple patterns in such data.

There are many applications of data mining that ﬁt into this context. The typical example

from which the analysis gets its name is the shopping mall. In this situation, the objective

is to analyze customers’ buying habits by ﬁnding associations between the diﬀerent items

that they place in their shopping baskets. The discovery of such association rules can help

the retailer develop marketing strategies by gaining insight into matters such as which items

are most frequently purchased by customers. It also helps in inventory management and sales

promotion strategies.

It is generally accepted that the discovery of association rules is dependent only on the

discovery of frequent sets. The popular algorithms are thus concerned with determining the

set of frequent itemsets in a given set of operation databases. The procedure is basically to

compute the frequency of occurrences of each itemset in the database. As the total number

of itemsets is exponential in terms of the number of items, it is not possible to count the

frequencies of these sets by reading the database in just one pass; moreover a single pass will

need numerous counters. Consequently, having multiple passes for generating all the frequent

itemsets is unavoidable. Therefore, distinct algorithms for the discovery of association rules

point at dropping the number of passes by generating candidate sets, which are probable

frequent sets. The algorithms vary from one another in the method of handling the candidate

sets and the method of reducing the number of database passes. However, there are some new

methods that attempt to detect frequent sets without having to generate candidate sets.

In this perspective, we can see whether it is possible to generate association rules incre-

mentally. The algorithms for discovering frequent sets are not directly suitable in the situation

where the underlying database is incremented sporadically. The objective is to prevent com-

puting the frequent sets again for the incremented set of data. The notion of border sets turns

out to be very important in this perspective, because they can be determined without addi-

tional eﬀort. Additionally, if no border set is eliminated by a database update, it is shown that

the set of frequent sets can be found without having to search through the whole database.

In the real world, the number of frequent sets is enormous; thus the number of association

rules is also too large to be useful. It turns out to be signiﬁcant to generate only those associ-

Get *Building an Intelligent Web: Theory and Practice* now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.