Chapter 6Preparing to Model the Data

  1. 6.1 Supervised Versus Unsupervised Methods
  2. 6.2 Statistical Methodology and Data Mining Methodology
  3. 6.3 Cross-Validation
  4. 6.4 Overfitting
  5. 6.5 BIAS–Variance Trade-Off
  6. 6.6 Balancing the Training Data Set
  7. 6.7 Establishing Baseline Performance
    1. The R Zone
    2. Reference
    3. Exercises

6.1 Supervised Versus Unsupervised Methods

Data mining methods may be categorized as either supervised or unsupervised. In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structure among all the variables. The most common unsupervised data mining method is clustering, our topic for Chapters 10 and 11. For example, political consultants may analyze congressional districts using clustering methods, to uncover the locations of voter clusters that may be responsive to a particular candidate's message. In this case, all appropriate variables (e.g., income, race, gender) would be input to the clustering algorithm, with no target variable specified, in order to develop accurate voter profiles for fund-raising and advertising purposes.

Another data mining method, which may be supervised or unsupervised, is association rule mining. In market basket analysis, for example, one may simply be interested in “which items are purchased together,” in which case no target variable would be identified. The problem here, of course, is that there are so many items for sale, that searching for all possible associations ...

Get Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.