The Life of a MOM 2005 Alert

Now for the life cycle of a MOM 2005 alert. This explanation shows how MOM 2005 processes information, introduces you to the Operator console, and discusses how MOM 2005 is used on a daily basis to manage your environment.

A MOM 2005 alert tells you when something significant happened somewhere in one of your systems. Not all alerts are created equal; they come in different levels of severity, from the benign informational and success alerts to urgent service unavailable and critical error alerts.

Note that a MOM 2005 alert is not the same as a Windows Event log event. A Windows event is written to the event log on the server that the event occurred on and goes no further. It is specific to a service or component of that server and is, essentially, restricted to that computer.

Sometimes, a single event provides you with enough information to take action, but most of the time it doesn’t. Events do play a role in the generation of alerts, but they themselves are not alerts. In the world of MOM, think of an event as the indication of a symptom that something is wrong, not a diagnosis in and of itself. For example, if someone is sick and has a 103-degree fever, the measurement of the 103-degree fever would be like an event, as would a cough and stiff neck. This person would then go to a doctor who would consider all the symptoms before making a diagnosis (an alert) of the flu and prescribing bed rest, plenty of fluids, and pain reliever.

All alerts are generated by MOM 2005 agents (see point 1 in Figure 1-2), whether that agent is running on a monitored computer or on the management server. In the normal course of operations, agents function independently of administrator intervention. They receive instructions on what work to perform and how to perform it from the management packs (the health rules). At the same time, agents collect data from their host machines by monitoring things such as the event logs, performance monitor counters, and executing scripts that use the Windows Management Interface (WMI) API against monitored computers. Agents then compare the collected data to values that have been predefined by Microsoft (for Microsoft-written applications) to describe health for that application. When the collected data meets the criteria in the health rules, the agent generates an alert. The alert is sent to the management server (see point 2 in Figure 1-2), recorded in the operations database, cross-correlated with other alerts to determine if a summary alert or other alert should be generated, and then surfaces in the Operator console (see point 3 in Figure 1-2). After a period of time, most alerts are groomed (deleted) out of the operations database.

Alert flow through MOM

Figure 1-2. Alert flow through MOM

You will manage alerts, and almost all the other information that MOM gives you, in the Operator console . Once the alert surfaces in the Operator console, you can act on it, drill down into it for more information, and modify it by adding your own information on how it was fixed.

Figure 1-3 shows the Operator console with an alert that was created by unplugging the network cable from the computer homemomserver. The Alerts view shown here, sorted by the Time Last Modified field, is the default view for the Operator console. The whole console is patterned after Outlook 2003, so there are some panes intentionally hidden here for the sake of simplicity. The Operator console is discussed in detail in Chapter 6.

The heartbeat failure alert means that homemomserver is not communicating with the management server

Figure 1-3. The heartbeat failure alert means that homemomserver is not communicating with the management server

In the Alert Details pane , MOM automatically displays the Properties tab of the alert that has been selected in the Results pane. What is displayed in the Results pane is controlled by your selections in the Navigation pane.

Depending on what is going on in your environment, and how often you clear the alerts by resolving them, the Results pane of the Alerts view may be full of alerts or only have a few. If there are multiple alerts present there, sort them by severity in descending order by clicking on the Severity column header—this places the most severe alerts at the top. With most alerts, the first thing you want to do is read the information that it contains and get more information if you are not familiar with the alert. You do this by going through the tabs of the alert in the Details pane.

The Properties tab gives the description of the alert; in this case, “Computer HOMEMOMSERVER in domain HOMELAB may be down and does not respond to ping. The last contact time was 9/13/2005 22:40:05. Computer management mode is: Agent.” Along with the name of the alert, “MOM Agent heartbeat failure,” this tells you a few things. First off, the agent on the computer missed its regularly scheduled heartbeat—its last good heartbeat was at 10:40 p.m.—which could be due to any number of reasons, but the computer is also not responding to a TCP/IP ping.

Two other things to note on the Properties tab are the Resolution State and the Repeat Count fields. An alert can exist in one of several resolution states. When it first surfaces in the Operator console, it will always be in a resolution state of New unless you manually configure the rule that generated the alert with a different default state. Since this alert is now being examined and the issue that caused it is being resolved, the resolution state is changed to Acknowledged. This lets anyone else who is viewing the Operator console know that the alert has been seen and is being acted on (see Figure 1-4).

Setting the alert resolution state to Acknowledged

Figure 1-4. Setting the alert resolution state to Acknowledged

If additional help is needed to fix the issue, the resolution state can be updated again to one of the other values (Levels 1 through 4), which assigns the disposition of an alert to a different group. This will help you keep track of what is going on with the alert and who owns the alert. MOM 2005 also tracks how long an alert is in any of the resolution states. When the time spent in a resolution state exceeds a configurable limit, the alert is flagged and it will appear in the Service Level Exceptions Alert view . This lets you know that the person the alert was assigned to has not updated or resolved the alert in the allotted time. It will also show if the company’s service-level agreements are being met on problem response and resolution time.

Out-of-the-box, a change in the resolution state does not kick off a workflow, like a help desk trouble ticketing application. However, since resolution state is a property of an alert, you could script a response that fires when an alert is placed into a state, such as “Level 1: Assigned to help desk or local support.” This would notify a predefined group of users (Level 1 operators) via email, pager, or another mechanism that they have been assigned an alert and are responsible for resolving it. See the "Alert notification" section in Chapter 4.

The Repeat Count field indicates how many times this specific alert has been raised in the Operator console. MOM 2005 will increment the value in this field every time a duplicate alert is generated rather than placing a new alert into the Alerts view, which could possibly generate a page or email. This saves you from being flooded with duplicate alerts. For an alert to be considered a duplicate, there must be an existing alert in the Operator console with a resolution state of anything except Resolved, and it must have been generated by the exact same rule on the exact same machine. The criteria can be further refined in the rule under Alert Suppression. This is performed in the Administrator console and is covered in Chapter 5.

The next place to look for more information is on the Events tab in the Details pane. This tab lists all the Windows events that are associated with the “MOM Agent heartbeat failure alert” (see Figure 1-5).

This event actually occurred on the management server homesqlserver and has the severity of Warning. MOM 2005 uses a small lightning bolt in a yellow triangle to indicate the event associated with an alert. When you start examining the different views in the Operator console, you will see the Events view. There, you can examine all Windows events that have been collected from managed computers (no more reviewing the event logs one machine at a time) and you will notice that some events are not associated with alerts and some are.

Tip

You should be asking why this event was generated on a machine other than the one that is not communicating (homemomserver). This is because machines cannot perform heartbeat checking on themselves. If they did, then when they went down or became unavailable, the MOM agent would not be able to communicate to the outside world to notify the management server that the machine was down.

Next, drill down into the Event Details view by double-clicking to see what the specific event number is (see Figure 1-6).

The Windows event that triggered the “MOM Agent heartbeat failure” alert

Figure 1-5. The Windows event that triggered the “MOM Agent heartbeat failure” alert

The details of a Windows event

Figure 1-6. The details of a Windows event

This event number is 21209, and you can use this information for further research if necessary. The Alerts tab displays the alerts that this event is associated with, which you already know about.

Back in the Alert Details pane, review the Product Knowledge tab . This tab is prepopulated with information from the application vendor, in this case Microsoft (see Figure 1-7).

The information on the Product Knowledge tab suggests likely causes and the resolution steps

Figure 1-7. The information on the Product Knowledge tab suggests likely causes and the resolution steps

You can skip the Summary section since you already know what this alert is about, but the Causes section reads like this:

  • The computer is unavailable.

  • The computer’s domain is unavailable.

  • The MOM Service on the computer is unavailable.

  • The agent was uninstalled from the computer.

The Resolutions section gives you various suggestions depending on the error, and reads like this:

The agent has never sent a heartbeat to the MOM Management Server.

  • Make sure the computer is available by using the Ping task in the Task pane of the MOM Operator console.

  • If the ping fails, make sure the computer still exists and that it is available on the network.

  • Make sure the MOM Service is running on the computer. You can start the MOM Service by using the Start MOM 2005 Service task in the Task pane of the MOM Operator console.

  • Try to update the agent settings by using the Update Agent Settings dialog on the Management Server.

  • Reinstall the agent.

The agent has not recently sent a heartbeat to the MOM Management Server.

  • Make sure the computer is available by using the ping command.

  • If the ping fails, make sure the computer still exists and that it is available on the network.

The agent failed to send a heartbeat within the allotted time.

  • Make sure the computer is available by using the ping command.

  • If the ping fails, make sure the computer still exists and that it is available on the network.

  • Make sure the MOM Service is running on the computer.

Going through the Causes section , you can immediately eliminate the unavailable domain and the uninstalled agent. You know that the domain is available because all of the other computers in the domain have not lost communication with the domain controller. You also know that you did not uninstall the agent, either automatically, through the Administrator console, or manually. That leaves the unavailable computer and the unavailable MOM Service causes to further diagnose. By looking back at the alert, you can see that MOM pinged homemomserver and it did not respond, so the unavailable MOM service can be eliminated as a cause for the missing heartbeat. So, now you know that at the time of the alert being generated the server was either down or disconnected from the network.

Moving to the Resolutions section , you can also go straight to the description that most closely matches the current situation—in this case, “The agent has not recently sent a heartbeat to the MOM Management Server”—and follow the steps listed there. The first step is to attempt to ping the server again. But since the server is down, there may be other alerts being generated that are associated with homemomserver. These alerts could be flooding the console, but you can’t deal with these until the server is up. To help manage a flood of alerts from a downed machine, you can put the macine in maintenance mode . To do this, you right-click on the computer name in the Results pane and select Put Computer in Maintenance Mode.

When you place a managed computer into maintenance mode, all alerts generated for that specific computer are automatically resolved by the management server—they won’t surface in the Alerts view and you don’t need to deal with them. You can still examine them later because they have been captured in the operations database. When you place the server in maintenance mode, you configure the maintenance mode duration (it automatically expires on a configurable date and time) and the reason why the server is being placed in maintenance mode. This mode is especially useful when there is work planned on a server that will generate alerts, say the reboot of a server after the installation of a service pack. A small wrench icon in the column between Severity and Domain (see Figure 1-8) indicates that a computer is in maintenance mode.

homemomserver in maintenance mode

Figure 1-8.  homemomserver in maintenance mode

Now that MOM is protected from an alert flood from this computer, you can continue with the first step suggested in the Resolution section, pinging homemomserver.

To do this, you’ll need to open the Tasks pane in the Operator console, either by selecting Tasks pane from the View menu, pressing Ctrl-T, or clicking the toggle Tasks pane icon on the toolbar. The Tasks pane contains folders and leaf objects (see Figure 1-9). The leaf objects are preconfigured to execute commands against whatever machine has the focus in the Results pane. One of these functions is the ping command. This will produce the same result as opening a command prompt and running ping against homemomserver. Microsoft has included hooks into these tools in the Operator console to give you an integrated environment in which to perform your monitoring and troubleshooting tasks. You can leave it and make use of the tools manually, but why do things the hard way?

Warning

When you invoke a Task object, you will be given the opportunity to enter custom command-line parameters to be passed to the tool when it executes. So, you have to know what the tool does and how to use it before clicking on it in the Operator console. Otherwise, you won’t get the desired result and you won’t know what the output means.

The folders in the Tasks pane correspond to the application management packs (the health rules) that have been imported into MOM 2005. Each management pack is preconfigured with tasks that provide you with access to the tools that are most commonly used in troubleshooting that application. Additional tasks can be created as needed in your environment through the Administrator console.

Since you can’t ping the server, there is no point in trying to make use of some of the other tasks listed, such as Remote Desktop or Computer Management, in this remote diagnosis process. These do, however, invoke the same tools that you are

Ping results continue to indicate that homemembersrvr is not accessible from the network

Figure 1-9. Ping results continue to indicate that homemembersrvr is not accessible from the network

already familiar with—Remote Desktop is the Windows 2003 version of terminal services in administration mode, and Computer Management is the same interface you get when you right-click the My Computer icon on your desktop and select the Manage option.

At this point, you have to physically check the computer and plug the disconnected network cable back in. Then you can run the ping task again and you will get a successful response. Some other alerts pop up, telling you that the network connection had been disconnected.

Now that you know what to do to resolve this issue, you should capture that information into the alert so that the next time an alert of this type occurs, information specific to your environment is available along with the product knowledge from the vendor. You will do this on the Company Knowledge tab in the Details pane (see Figure 1-10). Don’t record historical data here, only solution-specific data. After all, for future alerts it doesn’t matter that someone kicked the network cable out of the server on the night of the Christmas party. Future troubleshooters will only want to know how to fix the problem.

You can now change the state of the “MOM Agent heartbeat failure” alert to Resolved and remove homemomserver from maintenance mode. These actions will remove the alert from the Alert view of the Operator console and allow other heartbeat failure alerts for homemomserver to surface as new alerts in the console.

Enter solution-specific information in the Company Knowledge tab

Figure 1-10. Enter solution-specific information in the Company Knowledge tab

Get Essential Microsoft Operations Manager now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.