164 WebSphere eXtreme Scale Best Practices for Operation and Management
com.ibm.CORBA.FragmentTimeout=5
com.ibm.CORBA.LocateRequestTimeout=5
Setting core group heartbeat intervals
WebSphere eXtreme Scale relies on a heartbeat mechanism to detect whether containers
are up and running. The actual implementation is based on the core groups that are used by
WebSphere’s High Availability Manager.
Failures because of process crashes (or
soft failures) are typically detected in less than one
second. The network sockets are automatically closed by the operating system hosting the
process when a soft failure occurs. A
hard failure is a physical computer or server crash,
network cable disconnection, or operating system error. In this case, it can take much longer
for WebSphere eXtreme Scale to detect that a container has failed. To ensure that hard
failures are detected within a reasonable amount of time, configure the heartbeat mechanism.
Configuring failover for stand-alone environments
You can configure heartbeat intervals on the command line by using the -heartbeat
parameter of the startOgServer.sh command-line script. Table 7-1 shows the available
options.
Table 7-1 Heartbeat interval options for stand-alone environments
Configuring failover for WebSphere Application Server V6 environments
You can configure WebSphere Application Server Network Deployment 6.0.2 and 6.1 to
control the WebSphere eXtreme Scale failover behavior. The default failover time for hard
failures is approximately 200 seconds, which is relatively long.
WebSphere eXtreme Scale running in a WebSphere Application Server process inherits the
failover characteristics from the core group settings of the application server. Table 7-2 on
page 165 lists the custom properties that are used to configure the core group heartbeat
settings for various versions of WebSphere Application Server Network Deployment V6 and
V6.1. These properties are specified using custom properties on the core group using the
WebSphere administrative console and must be specified for all core groups that are used by
the application.
You must also specify the number of missed heartbeats with the
IBM_CS_FD_CONSECUTIVE_MISSED property. The value indicates how many heartbeats can be
missed before a peer JVM is considered as failed. The hard failure detection time is
approximately the product of the heartbeat interval and the number of missed heartbeats.
Heartbeat interval timing: An aggressive heartbeat interval can be useful when the
processes and network are stable. If the network or processes are not optimally
configured, heartbeats might be missed, which can result in a false failure detection.
Value Failover detection Description
0 Typical (default) Failovers are typically detected
within 30 seconds.
-1 Aggressive Failovers are typically detected
within 5 seconds.
1 Relaxed Failovers are typically detected
within 180 seconds.