Chapter 4. Systems Management 229
recovery times. For example, an operator might fail to see a message or type a command
incorrectly. Also, an operator might have to type long sequences of commands, remembering
the command syntax of several programs or components (or take the time to look them up).
The opportunities for operator error are many.
q Substitute automatic responses for operator-typed commands. That will reduce the
opportunities for error. When operator intervention is required, automation procedures can
simplify the tasks, reduce the chances of mistakes, and ensure similar responses to
similar events. Automation also expedites shutdown, initialization, and recovery
procedures, reducing planned downtime.
q Automate recovery procedures. If an application experiences an error, attempt to recover
from the situation as your policy defines. Recovery can include issuing commands or
replies to a message, and restarting the application if it has ABENDed. You can also
specify selective conditions and thresholds under which the automation does not attempt
to recover an ABENDed application. Also, consider conditions that need to be satisfied
before the starting or stopping of an application, for example, that a certain application
needs to be backed up before it can be started.
q Suppress messages. The first job of an automation project is to manage the messages
that are directed to Operations. Log analysis programs can be used in analyzing logs. The
analysis programs helps identify frequently issued messages to target for suppression or
automation.
q Monitor key batch job start and end times, network activity, and I/O errors to critical data
sets.
4.4.5 Application testing
q New versions of z/OS and subsystems typically have new and changed messages. As the
new software versions are being tested, automation changes to handle changed
messages should also be tested at the same time. Typically, automation changes consists
of just suppressing the new Informational messages. Decisions are needed on how to
handle the error and action messages, with automation trying to handle the action
messages being best. Try to only display messages when operator action that threatens
availability is required.
q New versions of applications also have new and changed messages. Make sure
automation is there for message suppression and control for this as well. This requires
that all application messages have a message ID.
q As with all changes, test automation changes on test systems before promoting them to
production.
4.4.6 Passive and active monitoring
Passive monitoring is the only type of monitoring found in many locations. It consists of simply
receiving information from system messages, alerts, and notifications. These can cause
updates to resource status displays, which drive operator action that an event has already
started. It would be nice if automation can notify operators of a potential problem before they
do start. This is done by active monitoring.
Active monitoring tools can proactively identify and respond to system events that typically
precede larger system problems, thereby preventing many types of common system failures,
ranging from performance degradation to brief downtime to catastrophic failure. To maximize
availability, tools can continually monitor system components, such as CPU, I/O rates,
transaction rates, network connectivity, and their interactions, identifying events that are likely
to occur and take appropriate automatic actions to prevent a failure. Many problems can be

Get Achieving the Highest Levels of Parallel Sysplex Availability now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.