Chapter 7. Operations and monitoring 163
Figure 7-3 Components that have an impact on the speed at which failures are detected
Note that you also need to tune the operating system for efficient operation and failure
detection. The following page in the information center provides a good starting point. The
tuning is focused around network I/O.
http://publib.boulder.ibm.com/infocenter/wxsinfo/v7r1/index.jsp?topic=/com.ibm.web
sphere.extremescale.admin.doc/cxsopernetw.html
7.4.1 Container failover detection
This section describes settings that can be used to configure the failover detection on the
containers.
Setting container ORB time-outs
The settings in Example 7-21 show how to set the failover detection for the ORB to
5 seconds. ORB properties can be set with an orb.properties file, as application server
settings in the administrative console, or as custom properties on the ORB in the
administrative console.
Example 7-21 orb.properties configuration file
com.ibm.CORBA.RequestTimeout=5
com.ibm.CORBA.ConnectTimeout=5
WAS Application Server
WebSphere eXtreme Scale
client application
ORB
WebSphere Application Server
HAManager Coregroup
WebSphere eXtreme Scale
Grid Container
backingMap
ORB
WebSphere Application Server
WebSphere eXtreme Scale
grid container
ObjectGrid
ORB
HAManager coregroup
ORB server timeouts
txTimeout
lockTimeout
Heartbeat settings
requestRetryTimeout
ORB client timeouts
164 WebSphere eXtreme Scale Best Practices for Operation and Management
com.ibm.CORBA.FragmentTimeout=5
com.ibm.CORBA.LocateRequestTimeout=5
Setting core group heartbeat intervals
WebSphere eXtreme Scale relies on a heartbeat mechanism to detect whether containers
are up and running. The actual implementation is based on the core groups that are used by
WebSphere’s High Availability Manager.
Failures because of process crashes (or
soft failures) are typically detected in less than one
second. The network sockets are automatically closed by the operating system hosting the
process when a soft failure occurs. A
hard failure is a physical computer or server crash,
network cable disconnection, or operating system error. In this case, it can take much longer
for WebSphere eXtreme Scale to detect that a container has failed. To ensure that hard
failures are detected within a reasonable amount of time, configure the heartbeat mechanism.
Configuring failover for stand-alone environments
You can configure heartbeat intervals on the command line by using the -heartbeat
parameter of the startOgServer.sh command-line script. Table 7-1 shows the available
options.
Table 7-1 Heartbeat interval options for stand-alone environments
Configuring failover for WebSphere Application Server V6 environments
You can configure WebSphere Application Server Network Deployment 6.0.2 and 6.1 to
control the WebSphere eXtreme Scale failover behavior. The default failover time for hard
failures is approximately 200 seconds, which is relatively long.
WebSphere eXtreme Scale running in a WebSphere Application Server process inherits the
failover characteristics from the core group settings of the application server. Table 7-2 on
page 165 lists the custom properties that are used to configure the core group heartbeat
settings for various versions of WebSphere Application Server Network Deployment V6 and
V6.1. These properties are specified using custom properties on the core group using the
WebSphere administrative console and must be specified for all core groups that are used by
the application.
You must also specify the number of missed heartbeats with the
IBM_CS_FD_CONSECUTIVE_MISSED property. The value indicates how many heartbeats can be
missed before a peer JVM is considered as failed. The hard failure detection time is
approximately the product of the heartbeat interval and the number of missed heartbeats.
Heartbeat interval timing: An aggressive heartbeat interval can be useful when the
processes and network are stable. If the network or processes are not optimally
configured, heartbeats might be missed, which can result in a false failure detection.
Value Failover detection Description
0 Typical (default) Failovers are typically detected
within 30 seconds.
-1 Aggressive Failovers are typically detected
within 5 seconds.
1 Relaxed Failovers are typically detected
within 180 seconds.

Get WebSphere eXtreme Scale Best Practices for Operation and Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.