WebSphere eXtreme Scale Best Practices for Operation and Management

Chapter 7. Operations and monitoring 163

Figure 7-3 Components that have an impact on the speed at which failures are detected

Note that you also need to tune the operating system for efficient operation and failure

detection. The following page in the information center provides a good starting point. The

tuning is focused around network I/O.

http://publib.boulder.ibm.com/infocenter/wxsinfo/v7r1/index.jsp?topic=/com.ibm.web

sphere.extremescale.admin.doc/cxsopernetw.html

7.4.1 Container failover detection

This section describes settings that can be used to configure the failover detection on the

containers.

Setting container ORB time-outs

The settings in Example 7-21 show how to set the failover detection for the ORB to

5 seconds. ORB properties can be set with an orb.properties file, as application server

settings in the administrative console, or as custom properties on the ORB in the

administrative console.

Example 7-21 orb.properties configuration file

com.ibm.CORBA.RequestTimeout=5

com.ibm.CORBA.ConnectTimeout=5

WAS Application Server

WebSphere eXtreme Scale

client application

ORB

WebSphere Application Server

HAManager Coregroup

WebSphere eXtreme Scale

Grid Container

backingMap

ORB

WebSphere Application Server

WebSphere eXtreme Scale

grid container

ObjectGrid

ORB

HAManager coregroup

ORB server timeouts

txTimeout

lockTimeout

Heartbeat settings

requestRetryTimeout

ORB client timeouts

164 WebSphere eXtreme Scale Best Practices for Operation and Management

com.ibm.CORBA.FragmentTimeout=5

com.ibm.CORBA.LocateRequestTimeout=5

Setting core group heartbeat intervals

WebSphere eXtreme Scale relies on a heartbeat mechanism to detect whether containers

are up and running. The actual implementation is based on the core groups that are used by

WebSphere’s High Availability Manager.

Failures because of process crashes (or

soft failures) are typically detected in less than one

second. The network sockets are automatically closed by the operating system hosting the

process when a soft failure occurs. A

hard failure is a physical computer or server crash,

network cable disconnection, or operating system error. In this case, it can take much longer

for WebSphere eXtreme Scale to detect that a container has failed. To ensure that hard

failures are detected within a reasonable amount of time, configure the heartbeat mechanism.

Configuring failover for stand-alone environments

You can configure heartbeat intervals on the command line by using the -heartbeat

parameter of the startOgServer.sh command-line script. Table 7-1 shows the available

options.

Table 7-1 Heartbeat interval options for stand-alone environments

Configuring failover for WebSphere Application Server V6 environments

You can configure WebSphere Application Server Network Deployment 6.0.2 and 6.1 to

control the WebSphere eXtreme Scale failover behavior. The default failover time for hard

failures is approximately 200 seconds, which is relatively long.

WebSphere eXtreme Scale running in a WebSphere Application Server process inherits the

failover characteristics from the core group settings of the application server. Table 7-2 on

page 165 lists the custom properties that are used to configure the core group heartbeat

settings for various versions of WebSphere Application Server Network Deployment V6 and

V6.1. These properties are specified using custom properties on the core group using the

WebSphere administrative console and must be specified for all core groups that are used by

the application.

You must also specify the number of missed heartbeats with the

IBM_CS_FD_CONSECUTIVE_MISSED property. The value indicates how many heartbeats can be

missed before a peer JVM is considered as failed. The hard failure detection time is

approximately the product of the heartbeat interval and the number of missed heartbeats.

Heartbeat interval timing: An aggressive heartbeat interval can be useful when the

processes and network are stable. If the network or processes are not optimally

configured, heartbeats might be missed, which can result in a false failure detection.

Value Failover detection Description

0 Typical (default) Failovers are typically detected

within 30 seconds.

-1 Aggressive Failovers are typically detected

within 5 seconds.

1 Relaxed Failovers are typically detected

within 180 seconds.

Get WebSphere eXtreme Scale Best Practices for Operation and Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

WebSphere eXtreme Scale Best Practices for Operation and Management by Ying Ding, Bertrand Fayn, Art Jolin, Hendrik Van Run, Carla Sadtler, Chunmo Son, Sukumar Subburaj, Tong Xie

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly