WebSphere eXtreme Scale Best Practices for Operation and Management

58 WebSphere eXtreme Scale Best Practices for Operation and Management

4.1.6 Topology considerations for failure handling

Now, you must account for failures. This part is always a judgment call, but there are three

failure scenarios that are commonly considered:

򐂰 The topology must survive a single JVM failure and be able to reconstruct the full set of

primary and replica shards (maxSynchReplicas or maxAsynchReplicas). The simplest and

best way to allow for this reconstruction is to increase C by 1. For this reason, we did not

previously round up C so it was evenly divisible by N, your number of servers. After

increasing by 1, if the new C is not evenly divisible by N, you might further increase C until

it is. Extra memory never hurt any grid.

򐂰 The topology must survive a single server failure and be able to reconstruct the full set of

primary and replica shards (maxSynchReplicas or maxAsynchReplicas). This rule allows

any one server to be down for an extended period of time, such as for maintenance or

diagnostics. To ensure this capability, add one more server with the same amount of

memory as the others (if servers vary in memory size, add one more of the larger servers).

If you want to be less conservative, you can decide that your requirement is only that the

grid remain operational. If your topology to this point has allowed for one or more replicas,

one server going down will result in the (extremely temporary we hope) loss of a number

of replicas (and a warning message), but application operation will continue because only

primaries are used in operation. Get that server running again immediately so that a

subsequent JVM failure will not trigger the loss of actual primary data.

򐂰 The topology must survive a single server failure followed by a single JVM failure. If you

have shut down a server for the maintenance of the OS or another reason, a single JVM

failing must not disable your grid. You can determine if you can survive this situation with

the following conditions:

– Normal state is C containers over N servers, or C/N containers per server.

– One server down means you have C - C/N containers remaining.

– A subsequent JVM down means that you have C - C/N - 1 containers remaining.

– If and only if the following condition is true, will you survive a double failure:

(C - C/N - 1) * Available heap @ 60% use > PrimaryDataSize

And now, you know why we included the “Available heap @ 60% use” column in Table

4-1 on page 54.

4.1.7 Determining the number of partitions

The value that you choose for numberOfPartitions in the deployment XML is only significant

if you choose a number that is too small or far too large. There are various subtle effects from

choosing a large or a small number but the primary consideration results from the fact that a

given partition must live wholly in a single container (just as a given object – or entity tree if

using the WebSphere eXtreme Scale entity model - must live wholly in a single partition).

eXtreme Scale is extremely elastic in that you can grow or shrink the number of containers

Each new container simply uses the same XML file pair as the previous containers did. The

only limit on the number of containers you have running is the number of partitions; if

numberOfPartitions is P, you cannot have more than P containers started to host data.

Additional containers will start, but they will not receive any partition to host.

For certain applications, C (the number of containers calculated previously) is the number that

you expect to always have started. For other grids, you know that C is fine most of the time

but you periodically have spikes of load (quarterly, end of month, and so on) where you will

Get WebSphere eXtreme Scale Best Practices for Operation and Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

WebSphere eXtreme Scale Best Practices for Operation and Management by Ying Ding, Bertrand Fayn, Art Jolin, Hendrik Van Run, Carla Sadtler, Chunmo Son, Sukumar Subburaj, Tong Xie

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly