WebSphere eXtreme Scale Best Practices for Operation and Management

148 WebSphere eXtreme Scale Best Practices for Operation and Management

Teardown failed, now what

Check the logs of the primary catalog server to ensure that the catalog server is in a good

state with no current exceptions. If the catalog server appears to be in an unhealthy state, yet

the replica catalog servers appear healthy, stop or kill the process of the primary catalog

server and invoke the teardown again.

If the teardown was invoked using filtering options and the filtering appears to have failed,

attempt to invoke the teardown by passing in an explicit list of container servers to tear down.

If the teardown status for a server has failed, the best course of action is to take a series of

thread dumps from the process and then force termination (kill). Save all thread dumps and

logs from the process for evaluation. As long as the server is not listed in the output from the

xsadmin -containers command, it is not considered available from the perspective of the

catalog server’s placement plans and routing. If the server still appears in the

xsadmin -routetable or xsadmin -containers output, attempt another teardown command

action.

If all else fails, you can kill any processes that are in an unhealthy state; the ability to restart a

catalog or container will not be harmed by killing its process. If the processes that you kill

include all catalog servers or so many containers that primary data is lost, the grid is

effectively down and all remaining processes must be stopped or killed. The grid can then be

restarted.

7.1.8 What to do when a JVM is lost

You manage high availability in WebSphere eXtreme Scale through the use of core groups

and the high availability manager. The catalog service is responsible for placing container

servers in core groups. A single member of the core group is designated as the leader of the

core group. The core group leader contacts the JVMs in the group and reports failures and

membership changes (a failed or new JVM) to the catalog service. JVM failures are detected

by the core group leader through missed heartbeats or from recognizing that a JVM’s socket

has closed. The catalog service itself operates as a private core group. It contains logic to

detect catalog server failures and includes quorum logic.

You need to be aware of the following failure scenarios:

򐂰 Container server failure

If the catalog service marks a container JVM as failed and the container is later reported

as alive, the container JVM will be told to shut down the container servers. A JVM in this

state will not be visible in xsadmin queries. These JVMs need to be manually restarted.

򐂰 Quorum loss from catalog failure

All members of the catalog service are expected to be online. The catalog service will only

respond to container events while the catalog service has quorum. Container servers will

retry requests rejected by the catalog. If you attempt to start a container during a quorum

loss event, the container will not start.

If quorum is lost due to a catalog server JVM failure, manually override the quorum as

quickly as possible.

If quorum is lost due to a temporary network failure, no action needs to be taken, unless

the outage will last for more than a few minutes. In that case, manually override the

quorum.

Note that client connectivity is allowed during quorum loss. If no container failures or

connectivity issues happen during the quorum loss event, clients can still fully interact with

Get WebSphere eXtreme Scale Best Practices for Operation and Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

WebSphere eXtreme Scale Best Practices for Operation and Management by Ying Ding, Bertrand Fayn, Art Jolin, Hendrik Van Run, Carla Sadtler, Chunmo Son, Sukumar Subburaj, Tong Xie

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly