36 Mining Your Own Business in Telecoms Using DB2 Intelligent Miner for Data
Use of common data models
Defining data models for any application is often a complex task and defining
data models for data mining is no exception. Where the data model is required to
support an application that has specific requirements (for example, some form of
business reporting tool) then the data can be defined by asking the end users
what types of information they require and then performing the necessary
aggregations to support this requirement. In the case of data mining, the
challenge is that very often you are not sure at the outset which variables are
important and therefore exactly what is required. Generating data models for
completely new data mining applications can therefore be a time consuming
activity.
The alternative is to use common data models that have been developed to solve
similar business issues to the ones you are trying to address. While these types
of models may not initially provide you with all of the information you require, they
are usually designed to be extendable to include additional variables. The main
advantage of using a common data model is that it will provide you with a way of
quickly seeing how data mining can be used within your business. In the following
chapters we suggest some simple data models that can be used in this way.
3.4.3 Step 3 Sourcing and preprocessing the data
The third step in the generic data mining method is the sourcing and
preprocessing of the data that populates the data model. Having a defined data
model provides the necessary structure, in terms of the variables that we are
going to mine, but we still have to provide the data.
Data sourcing and preprocessing comprises the stages of
identifying, collecting,
filtering
and aggregating (raw) data into a format required by the data models
and the selected mining function. Since sourcing and preparing the data are the
most time consuming parts of any data mining project, we describe these crucial
steps in broader detail. Where the data is derived from a data warehouse, many
of these stages will already have been performed.
The data sources
The data sources can be different by origin and content as shown in Figure 3-7.
Chapter 3. A generic data mining method 37
Figure 3-7 Data sources by origin and content
Every business uses standard internal data sources. Many of them are similar
from their point of content. Therefore, a
customer database or a product database
could be found in nearly any data scenario.
Data mining, in common with many other analysis tools, usually requires the data
to be in one consolidated table or file. If the variables required are distributed
across a number of sources then this consolidation must be performed such that
a consistent set of data records is produced. If the data is stored in relational
database tables then the creation of a new tables or a database view is relatively
straight forward, although where complex aggregations are required this can
have a significant impact on database resources.
38 Mining Your Own Business in Telecoms Using DB2 Intelligent Miner for Data
Data preprocessing
If the data has not been derived from a data warehouse then the data
preprocessing functions of cleansing, aggregated, transforming, and filtering,
that we described in 2.2, Data warehouse on page 8, must be undertaken.
Even when the data is derived from a data warehouse, there may still be some
additional transformations of the data that need to be performed before mining
can proceed. Structurally the
data preprocessing can be displayed as in
Figure 3-8.
Figure 3-8 Data preprocessing
Data mining tools usually provide limited capability to cleanse the data, because
this is a specialized process and there are a number of products that can be used
to do this efficiently. Aggregation and filtering can be performed in a number of
different ways depending on the precise structure of your data sources. Some of
the tools available to do this with the IM for Data product are described in 8.2.1,
Data preparation functions on page 147.
cleanse
view data mart
filter
aggregate
select
Data sources

Get Mining Your Own Business in Telecoms Using DB2 Intelligent Miner for Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.