Chapter 3. Architectural considerations 93
Workflow
ETL has excellent workflow control capabilities. It is good for scheduling,
dependencies, and error and exception handling. Typically EAI environments
have extensive rules based capabilities, whereas EII typical does not support
any workflow control.
3.3 Other considerations
In this section we look at some of the additional considerations that can impact
data access and data movement, even though they do not directly perform the
functions themselves.
3.3.1 Database
There are many functions performed by the database, other than simply storing
and retrieving data. And the data in the database can be used in many different
ways to satisfy the needs of the enterprise. In this section we consider a few of
those uses.
Operational systems
Most organizations have many independent sources of data that are often
deployed on a number of different hardware and software environments. As a
result, they also very often have different database and content stores
architectures. This has to be considered when making the choice of an
information integration architecture. Most organizations today have a preferred
database technology that impacts their decision when purchasing new
operational business solutions. The major business application vendors today
typically support all the major database technologies.
Data warehousing
The requirements for capabilities in database technology for a data warehousing
environment are quite different than those for an operational environment. A data
warehouse database, for example, needs to be scalable in terms of loading,
support for advanced ad-hoc query, and being able to deliver large result sets.
Whereas an operational environment supports many short running transactions
with very small storage requirements.
IBM DB2 has stepped up to the challenge of providing techniques and
technologies for creating and using large data warehousing environments. Two
significant examples are the multidimensional clustering and range partitioning.
94 Moving Forward with the On Demand Real-time Enterprise
Optimizing access using Multidimensional Clustering
Multidimensional Clustering (MDC) was designed to improve performance and
reduce table maintenance.
Relational database management systems (RDBMS) have traditionally stored
data randomly on storage devices, later relying on row ID (RID) indexes as the
means of locating and retrieving that data. Multidimensional clustering
technology provides another approach, and performance benefits, to storing,
locating, and accessing large volumes of data. That technology is now integrated
into DB2, as well as its concurrency and recovery mechanisms.
MDC functionality is transparent to both the user and the application. The only
requirement is to use an extension to the CREATE TABLE statement in order to
specify the organization of the data. Users can define multidimensional indexes
on a set of attributes of a table, and use standard SQL for data manipulation and
retrieval. For that reason, existing applications with standard SQL constructs can
take advantage of the performance benefits of multidimensional access methods
within your data warehouse.
Multidimensional clustering is based on the definition of one or more orthogonal
clustering attributes (or expressions) on a table. The table is organized physically
by associating records with similar values for the dimension attributes in an
extent (or block). DB2 maintains this physical layout efficiently and provides
methods of processing database operations for significant performance
improvements.
Clustering indexes are used to physically cluster data on insert according to the
order the clustering index. Such organization of data and index improves
performance of range queries and prefetching. A clustering index is a special
type of RID index that contains entries for all existing combinations of the index
key column and is used by the insert process to keep the rows of the table in
physical sequential order by the values of the index columns.
In data warehousing there are methods to control the insert/delete process to
increase the likelihood that there is free space for the clustering ratio to be
maintained. The cluster ratio tends to degrade after the free space is exhausted
because inserts, updates and deletes may cause the data to be placed out of
sequence. Once the clustering ratio drops below a certain threshold, the
optimizer no longer considers the table to be clustered. At that point the table
must be reorganized.
Range Partitioning
DB2 9 has the capability to support all three common methods of database
partitioning. It can simultaneously handle range or table partitioning,
Get Moving Forward with the On Demand Real-time Enterprise now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.