Chapter 4. How to discover the characteristics of your customers 57
Figure 4-1 The current business segmentation
4.4 Evaluating the data
Having created and populated our data models, the fourth stage in our data
is to perform an initial evaluation of the quality of the data itself,
and this before doing the actual data mining. This involves looking for missing
values, outlier values, and signs of correlation. This process will also give you an
initial understanding of your data and will allow you to find an appropriate way of
discretizing your variables.
This process can be done by utilizing the statistical functions in IM for Data and
the IM for Data Attribute Visualizer. Other data analysis applications can also be
used to give an overview of the data. Among these are DB2 OLAP, to identify
missing values present in the data at a high level of aggregation, and products
from Lotus (1-2-3), Microsoft (Excel), BRIO, Business Objects and SPSS.
58 Mining Your Own Business in Banking Using DB2 Intelligent Miner for Data
Step 1 — Looking for missing and outlying values
The univariate statistics for a subset of variables from our data model is
presented in Figure 4-2.
Figure 4-2 Variables for segmentation
The statistics are shown as a series of histograms and pie charts. In this step we
are looking for outliers, missing values and unusual distributions that may
indicate some systematic error in the data sourcing process.
The most effective way of identifying anomalies in your data is to visually inspect
the distribution of the variables. This will usually uncover obvious errors like
customers that are 189 years old or variables with a large number of missing
Chapter 4. How to discover the characteristics of your customers 59
Replacing missing values with a default valid value is a simple task that may
avoid a lot of mistakes when done initially. When replacing missing values, it is
important to retain the information that the value was missing. For numeric
variables this is done by creating one or more associate variables that contain
this information (for example, a variable with the value “M” if missing, and “P” if
populated). In the case of categorical variables, it is straight forward to replace
the missing variables with an indicator value (for example, “missing”). The
importance of missing values is most evident in relation to demographic and
relationship data. When this type of variable has missing values, it is typically an
indication that your relationship with the customer is not very well developed. For
example, if you do not know the address or marital status of a customer it
typically means that you do not have a close relationship with the customer.
The visual inspection can if necessary be supplemented with simple statistical
measures like minimum, maximum, average and standard deviation. If errors are
found these should be resolved by replacing or adjusting the effected values.
In our example, we have an exceptional (and unrealistic) case that there does not
seem to be any problems with missing or outlying values.
Step 2 — Identifying problems with variables
The next step is to remove redundant or duplicated variables. By redundant
variables, we mean variables that are highly correlated with other variables (see
Figure 4-3). If this is the case, using both variables in our segmentation will not
add additional information and may distort the result. The reason for this is that
data driven segmentation works by grouping similar records (customers in this
case). If a number of variables are highly correlated then the segmentation will
tend to use these variables before others, even though there is no additional
information derived. Where variables are highly correlated, either the variable
that conveys the greatest business value should be selected and the others
discarded, or statistical techniques such as principal component analysis or
factor analysis should be employed to construct new non-correlated variables.