Chapter 4. How to discover the characteristics of your customers 59
Replacing missing values with a default valid value is a simple task that may
avoid a lot of mistakes when done initially. When replacing missing values, it is
important to retain the information that the value was missing. For numeric
variables this is done by creating one or more associate variables that contain
this information (for example, a variable with the value “M” if missing, and “P” if
populated). In the case of categorical variables, it is straight forward to replace
the missing variables with an indicator value (for example, “missing”). The
importance of missing values is most evident in relation to demographic and
relationship data. When this type of variable has missing values, it is typically an
indication that your relationship with the customer is not very well developed. For
example, if you do not know the address or marital status of a customer it
typically means that you do not have a close relationship with the customer.
The visual inspection can if necessary be supplemented with simple statistical
measures like minimum, maximum, average and standard deviation. If errors are
found these should be resolved by replacing or adjusting the effected values.
In our example, we have an exceptional (and unrealistic) case that there does not
seem to be any problems with missing or outlying values.
Step 2 — Identifying problems with variables
The next step is to remove redundant or duplicated variables. By redundant
variables, we mean variables that are highly correlated with other variables (see
Figure 4-3). If this is the case, using both variables in our segmentation will not
add additional information and may distort the result. The reason for this is that
data driven segmentation works by grouping similar records (customers in this
case). If a number of variables are highly correlated then the segmentation will
tend to use these variables before others, even though there is no additional
information derived. Where variables are highly correlated, either the variable
that conveys the greatest business value should be selected and the others
discarded, or statistical techniques such as principal component analysis or
factor analysis should be employed to construct new non-correlated variables.