When working with data mining, it is useful to understand mining algorithm basics and when to apply each algorithm. Table 57.2 summarizes common algorithms used for the problem categories presented in this chapter's introduction.

Table 57.2 Common Mining Algorithm Usage

Problem Type Primary Algorithms
Segmentation Clustering, Sequence Clustering
Classification Decision Trees, Naive Bayes, Neural Network, Logistic Regression
Association Association Rules, Decision Trees
Estimation Decision Trees, Linear Regression, Logistic Regression, Neural Network
Forecasting Time Series
Sequence Analysis Sequence Clustering

These are guidelines only because not every data mining problem falls into these categories. In addition, there may be other algorithms that you can apply to the listed problem types.

Decision Trees

The decision trees algorithm is the most accurate for many problems. It operates by building a decision tree beginning with the All node, corresponding to all the training cases, as shown in Figure 57.3. Then an attribute is chosen to split those cases into groups, which then separate based on another attribute, and so on. The goal is to generate leaf nodes with a single predictable outcome. For example, if the goal is to identify who will purchase a bike, then leaf nodes should contain cases that are either bike buyers or not bike buyers, but no combinations (or as close to that goal as possible).

Figure 57.3 This is a great example of the decision tree ...

Get Microsoft SQL Server 2012 Bible now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.