6.1 INTRODUCTION

6.1.1 Overview

Dividing a data set into smaller subsets of related observations or groups is important for exploratory data analysis and data mining for a number of reasons:

  • Finding hidden relationships: Grouping methods organize observations in different ways. Looking at the data from these different angles will allow us to find relationships that are not obvious from a summary alone. For example, a data set of retail transactions is grouped and these groups are used to find nontrivial associations, such as customers who purchase doormats often purchase umbrellas at the same time.
  • Becoming familiar with the data: Before using a data set to create a predictive model, it is beneficial to become highly familiar with the contents of the set. Grouping methods allows us to discover which types of observations are present in the data. In the following example, a database of medical records will be used to create a general model for predicting a number of medical conditions. Before creating the model, the data set is characterized by grouping the observations. This reveals that a significant portion of the data consists of young female patients having flu. It would appear that the data set is not evenly stratified across the model target population, that is, both male and female patients with a variety of conditions. Therefore, it may be necessary to create from these observations a diverse subset that matches more closely the target population.
  • Segmentation: Techniques ...

Get Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.