8 1.3 RAMS
1.3 RAMS
RAMS was an abbreviation used by my team when we delivered the very
first Exchange 2000 Academy program in September 1999. After spending
several months working with the Microsoft engineering team and develop-
ing the material, Donald Livengood of Hewlett Packard (HP) came up with
the RAMS acronym, which is derived from reliability, availability, manage-
ability, and scalability. These four key features are well implemented and
represented by many key functions of the product.
In fact, since the release of Microsoft Exchange 2000 in late 2000, and
later with Exchange 2003, many deployments managed to benefit from the
RAMS features of the product, described further in this section. More
importantly, Exchange 2003 managed to grow out of the features from the
Windows Server 2003 environment, and, for each service pack or major
release, improves on the reliability, availability, manageability, and scalability.
Had the journey ended just yet? Definitely not—Microsoft gets to deal
with a legacy of Microsoft Exchange rollouts; even for future versions of the
product, we will see improvement in each of these areas. Some require fun-
damental changes at the operating system level, application level, or hard-
ware components level. As each mature, the end solution and user
experience improves.
1.3.1 Reliability
The goal of reliability in Microsoft Exchange 2003 is to perform service
functions under stated conditions within a given time period. Microsoft
Exchange has often suffered from database corruption errors that could be
caused by faulty hardware or software components, imposing on the
administrators a long and painful recovery process involving part or all of
the server and, always, the entire corrupted database. When Exchange 5.5
introduced unlimited storage, it also left the door open to unlimited prob-
lems: restoring a 16-GB database file takes much less time than restoring a
250-GB database file—time during which the users do not have access to
their mail service. Some deployments today have to deal (suffer?) with +1.5-
TB Information Stores, which cannot be repaired rapidly and for which
backup and recovery are painful.
To the extent possible, Microsoft improved the core database engine uti-
lized in Microsoft Exchange—Extensible Storage Engine (ESE)—to pre-
vent any malformed database pages from being stored on disk. Exchange
4.0 introduced the notion of database pages whose content could be vali-
1.3 RAMS 9
Chapter 1
dated by a simple checksum calculated over the 4-KB block making up a
database page. If the checksum stored with the page differed from the
checksum calculated after reading the page, the database was considered
corrupted, and the database was flagged as bad. With Exchange 2003 SP2,
the checksum can recover information: a single-bit flip can be recovered by
using an error-correcting checksum algorithm (instead of using a simple
error detection algorithm).
In a situation with page-level corruption, you have two choices:
1. Run the ESEUTIL tool to remove the invalid pages;
2. Restore the last known good database from backup and play back
the intermediate transactions stored in the transaction log files.
In fact, neither of these two solutions is very satisfactory, especially the
first, since it could lead to irreversible loss of data—which is impermissible
in modern infrastructures.
Microsoft worked hard to reduce the likelihood of software-based cor-
ruption. Today, virtually all page corruptions are due to faulty hardware: the
component at fault could be a disk, a controller, or an interconnect ele-
ment, such as a host-bus adapter, or fibre channel link (just like software,
hardware and firmware have bugs, too!). Some of these components have
their own built-in recovery mechanisms, and with Exchange 2003 SP2,
very few page-level corruptions are occurring; this is one thing the
Microsoft Exchange administrator does not have to worry about anymore!
In conjunction, hardware manufacturers have vastly improved the reli-
ability of storage infrastructures, especially when these are put under stress
load or abnormal activities (for example, a RAID5 volume rebuild). They
didnt really wait for Microsoft to do this, but as RAID and multidisks vol-
umes became more utilized, a significant effort and investment was put
forth to ensure that volume protection was actually efficient. In addition,
the notion of checksum has been extended to the transaction log records
(preventing you from playing back a corrupted transaction into the data-
base) and, with SP1, to the streaming store.
Unfortunately, we are still lacking the tools needed to recover data from
corrupted transaction log files. This can be an issue because if for some rea-
son your database needs to be recovered and transactions played back, and
if the transaction logs are corrupted halfway through the replay, you essen-
tially have lost data. I will remind you several times in this book: protecting

Get Microsoft® Exchange Server 2003 Scalability with SP1 and SP2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.