50 Event Management and Best Practices
2.4.3 Duplicate detection and throttling best practices
We make several recommendations for duplicate detection and throttling. We
also provide the rationale behind these recommendations:
Perform duplicate detection as close to the source as possible.
This practice saves cycles and system resources on event processors, and
minimizes bandwidth used for sending unnecessary events.
When possible, configure the event source to detect duplicates and suppress
them. If it is incapable of performing these actions or if implementing at the
source causes undue, cumbersome tool configurations, use the closest event
processor capable of performing the function.
Use throttling for intermittent problems that may clear themselves
automatically and do not always require action. After a problem is reported,
suppress or drop duplicates.
It is frustrating for a support person to investigate a problem only to find that
the problem has disappeared or requires no action. If this occurs too often,
the support person loses faith in the systems management tools and begins
to ignore its notifications.
For events that indicate problems always requiring action, inform when the
first event is received, and suppress or drop duplicates.
This notifies the support person most quickly and minimizes the mean-time to
repair for the problem.
Do not use duplicate events to re-inform whether the original problem has
been handled. Instead, use escalation.
Using duplicate events as reminders that a problem is still open is a bad
practice. The extra events clutter consoles, possibly forcing the operator to sift
through many events to find the meaningful ones. If there are too many
events, the console user may begin to ignore them or close them in mass.
See “Unhandled events” on page 60 for a discussion about using escalation
for events that have not been handled in a timely manner.
This bad practice typically arises in organizations that point fingers for
unhandled problems and assign blame. Those who are responsible for
resolving problems often need to justify why they miss problems and blame
the tool for not informing them. The event management implementers, fearing
these reproaches, configure the tools to send all occurrences of a problem
rather than just one. Unfortunately, this compounds the problem because now
the support person has to handle many more events and can still miss ones
that require action.
Management needs to create an environment on which problem post
mortems are used constructively, minimizing blame and enabling