Moving Forward with the On Demand Real-time Enterprise

Chapter 3. Architectural considerations 93

Workflow

ETL has excellent workflow control capabilities. It is good for scheduling,

dependencies, and error and exception handling. Typically EAI environments

have extensive rules based capabilities, whereas EII typical does not support

any workflow control.

3.3 Other considerations

In this section we look at some of the additional considerations that can impact

data access and data movement, even though they do not directly perform the

functions themselves.

3.3.1 Database

There are many functions performed by the database, other than simply storing

and retrieving data. And the data in the database can be used in many different

ways to satisfy the needs of the enterprise. In this section we consider a few of

those uses.

Operational systems

Most organizations have many independent sources of data that are often

deployed on a number of different hardware and software environments. As a

result, they also very often have different database and content stores

architectures. This has to be considered when making the choice of an

information integration architecture. Most organizations today have a preferred

database technology that impacts their decision when purchasing new

operational business solutions. The major business application vendors today

typically support all the major database technologies.

Data warehousing

The requirements for capabilities in database technology for a data warehousing

environment are quite different than those for an operational environment. A data

warehouse database, for example, needs to be scalable in terms of loading,

support for advanced ad-hoc query, and being able to deliver large result sets.

Whereas an operational environment supports many short running transactions

with very small storage requirements.

IBM DB2 has stepped up to the challenge of providing techniques and

technologies for creating and using large data warehousing environments. Two

significant examples are the multidimensional clustering and range partitioning.

94 Moving Forward with the On Demand Real-time Enterprise

Optimizing access using Multidimensional Clustering

Multidimensional Clustering (MDC) was designed to improve performance and

reduce table maintenance.

Relational database management systems (RDBMS) have traditionally stored

data randomly on storage devices, later relying on row ID (RID) indexes as the

means of locating and retrieving that data. Multidimensional clustering

technology provides another approach, and performance benefits, to storing,

locating, and accessing large volumes of data. That technology is now integrated

into DB2, as well as its concurrency and recovery mechanisms.

MDC functionality is transparent to both the user and the application. The only

requirement is to use an extension to the CREATE TABLE statement in order to

specify the organization of the data. Users can define multidimensional indexes

on a set of attributes of a table, and use standard SQL for data manipulation and

retrieval. For that reason, existing applications with standard SQL constructs can

take advantage of the performance benefits of multidimensional access methods

within your data warehouse.

Multidimensional clustering is based on the definition of one or more orthogonal

clustering attributes (or expressions) on a table. The table is organized physically

by associating records with similar values for the dimension attributes in an

extent (or block). DB2 maintains this physical layout efficiently and provides

methods of processing database operations for significant performance

improvements.

Clustering indexes are used to physically cluster data on insert according to the

order the clustering index. Such organization of data and index improves

performance of range queries and prefetching. A clustering index is a special

type of RID index that contains entries for all existing combinations of the index

key column and is used by the insert process to keep the rows of the table in

physical sequential order by the values of the index columns.

In data warehousing there are methods to control the insert/delete process to

increase the likelihood that there is free space for the clustering ratio to be

maintained. The cluster ratio tends to degrade after the free space is exhausted

because inserts, updates and deletes may cause the data to be placed out of

sequence. Once the clustering ratio drops below a certain threshold, the

optimizer no longer considers the table to be clustered. At that point the table

must be reorganized.

Range Partitioning

DB2 9 has the capability to support all three common methods of database

partitioning. It can simultaneously handle range or table partitioning,

Get Moving Forward with the On Demand Real-time Enterprise now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Moving Forward with the On Demand Real-time Enterprise by Chuck Ballard, Fabio Hasegawa, Gord Owens, Soren Ravn Pedersen, Klaus Subtil

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly