148 WebSphere eXtreme Scale Best Practices for Operation and Management
Teardown failed, now what
Check the logs of the primary catalog server to ensure that the catalog server is in a good
state with no current exceptions. If the catalog server appears to be in an unhealthy state, yet
the replica catalog servers appear healthy, stop or kill the process of the primary catalog
server and invoke the teardown again.
If the teardown was invoked using filtering options and the filtering appears to have failed,
attempt to invoke the teardown by passing in an explicit list of container servers to tear down.
If the teardown status for a server has failed, the best course of action is to take a series of
thread dumps from the process and then force termination (kill). Save all thread dumps and
logs from the process for evaluation. As long as the server is not listed in the output from the
xsadmin -containers command, it is not considered available from the perspective of the
catalog server’s placement plans and routing. If the server still appears in the
xsadmin -routetable or xsadmin -containers output, attempt another teardown command
action.
If all else fails, you can kill any processes that are in an unhealthy state; the ability to restart a
catalog or container will not be harmed by killing its process. If the processes that you kill
include all catalog servers or so many containers that primary data is lost, the grid is
effectively down and all remaining processes must be stopped or killed. The grid can then be
restarted.
7.1.8 What to do when a JVM is lost
You manage high availability in WebSphere eXtreme Scale through the use of core groups
and the high availability manager. The catalog service is responsible for placing container
servers in core groups. A single member of the core group is designated as the leader of the
core group. The core group leader contacts the JVMs in the group and reports failures and
membership changes (a failed or new JVM) to the catalog service. JVM failures are detected
by the core group leader through missed heartbeats or from recognizing that a JVM’s socket
has closed. The catalog service itself operates as a private core group. It contains logic to
detect catalog server failures and includes quorum logic.
You need to be aware of the following failure scenarios:
Container server failure
If the catalog service marks a container JVM as failed and the container is later reported
as alive, the container JVM will be told to shut down the container servers. A JVM in this
state will not be visible in xsadmin queries. These JVMs need to be manually restarted.
Quorum loss from catalog failure
All members of the catalog service are expected to be online. The catalog service will only
respond to container events while the catalog service has quorum. Container servers will
retry requests rejected by the catalog. If you attempt to start a container during a quorum
loss event, the container will not start.
If quorum is lost due to a catalog server JVM failure, manually override the quorum as
quickly as possible.
If quorum is lost due to a temporary network failure, no action needs to be taken, unless
the outage will last for more than a few minutes. In that case, manually override the
quorum.
Note that client connectivity is allowed during quorum loss. If no container failures or
connectivity issues happen during the quorum loss event, clients can still fully interact with