O'Reilly logo

Event Management and Best Practices by Michael Wallace, Guilherme Pereira, Jacqueline Meckwood, Peter Glasmacher, Tony Bhe

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

8 Event Management and Best Practices
discarded. Other times, a problem does not need to be investigated until it occurs
several times. For example, a high CPU condition may not be a problem if a
single process, such as a backup, uses many cycles for a minute or two.
However, if the condition happens several times within a certain time interval,
there most likely is a problem. In this case, the problem should be addressed
after the necessary number of occurrences. Unless diagnostic data, such as the
raw CPU busy values, is required from subsequent events, they can be dropped.
The process of reporting events after a certain number of occurrences is known
as
throttling.
1.3.4 Correlation
When multiple events are generated as a result of the same initial problem or
provide information about the same system resource, there may be a relationship
between the events. The process of defining this relationship in an event
processor and implementing actions to deal with the related events is known as
event correlation.
Correlated events may reference the same affected resource or different
resources. They may generated by the same event source or handled by the
same event processor.
Problem and clearing event correlation
This section presents an example of events that are
generated from the same event source and deal with the
same system resource. An agent monitoring a system
detects that a service has failed and sends an event to an
event processor. The event describes an error condition,
called a
problem event. When the service is later
restored, the agent sends another event to inform the
event processor the service is again running and the
error condition has cleared. This event is known as a
clearing event. When an event processor receives a
clearing event, it normally closes the problem event to
show that it is no longer an issue.
The relationship between the problem and clearing event
can be depicted graphically as shown in Figure 1-1. The
correlation sequence is described as follows:
򐂰 Problem is reported when received (Service Down).
򐂰 Event is closed when a recovery event is received
(Service Recovered).
Service Down
(Problem Event)
Service
Recovered
(Clearing Event)
Figure 1-1 Problem
and clearing
correlation sequence
Chapter 1. Introduction to event management 9
Taking this example further, assume that multiple agents are on the system. One
reads the system log, extracts error messages, and sends them as events. The
second agent actively monitors system resources and generates events when it
detects error conditions. A service running on the system writes an error
message to the system log when it dies. The first agent reads the log, extracts
the error messages, and sends it as an event to the event processor. The second
agent, configured to monitor the status of the service, detects that is has stopped
and sends an event as well. When the service is restored, the agent writes a
message to the system log, which is sent as an event, and the monitor detects
the recovery and sends its own event.
The event processor
receives both problem
events, but only needs to
report the service failure
once. The events can be
correlated and one of
them dropped. Likewise,
only one of the clearing
events is required. This
correlation sequence is
shown in Figure 1-2 and
follows this process:
򐂰 A problem event is
reported if received
from the log.
򐂰 The event is closed
when the Service Recovered event is received from the log.
򐂰 If a Service Down event is received from a monitor, the Service Down event
from the log takes precedence, and the Service Down event from a monitor
becomes extraneous and is dropped.
򐂰 If a Service Down event is not received from the log, the Service Down event
from a monitor is reported and closed when the Service Recovered event is
received from the monitor.
This scenario is different from duplicate detection. The events being correlated
both report service down, but they are from different event sources and most
likely have different formats. Duplicate detection implies that the events are of the
same format and are usually, though not always, from the same event source. If
the monitoring agent in this example detects a down service, and repeatedly
sends events reporting that the service is down, these events can be handled
with duplicate detection.
Service Down
(Problem Event from
Log)
Service Recovered
(Clearing Event from
Monitor)
Service Recovered
(Clearing Event from
Log)
Service Down
(Problem Event from
Monitor)
Figure 1-2 Correlation of multiple events reporting the
same problem

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required