O'Reilly logo

Essential SNMP, 2nd Edition by Kevin Schmidt, Douglas Mauro

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Introduction to SNMP and Network Management

In today’s complex network of routers, switches, and servers, it can seem like a daunting task to manage all the devices on your network and make sure they’re not only up and running but also performing optimally. This is where the Simple Network Management Protocol (SNMP) can help. SNMP was introduced in 1988 to meet the growing need for a standard for managing Internet Protocol (IP) devices. SNMP provides its users with a “simple” set of operations that allows these devices to be managed remotely.

This book is aimed toward system administrators who would like to begin using SNMP to manage their servers or routers, but who lack the knowledge or understanding to do so. We try to give you a basic understanding of what SNMP is and how it works; beyond that, we show you how to put SNMP into practice, using a number of widely available tools. Above all, we want this to be a practical book—a book that helps you keep track of what your network is doing.

This chapter introduces SNMP, network management , and change management. Obviously, SNMP is the focus of this book, but having an understanding of general network management concepts will make you better prepared to use SNMP to manage your network.

What Is SNMP?

The core of SNMP is a simple set of operations (and the information these operations gather) that gives administrators the ability to change the state of some SNMP-based device. For example, you can use SNMP to shut down an interface on your router or check the speed at which your Ethernet interface is operating. SNMP can even monitor the temperature on your switch and warn you when it is too high.

SNMP usually is associated with managing routers, but it’s important to understand that it can be used to manage many types of devices. While SNMP’s predecessor, the Simple Gateway Management Protocol (SGMP) , was developed to manage Internet routers, SNMP can be used to manage Unix systems, Windows systems, printers, modem racks, power supplies, and more. Any device running software that allows the retrieval of SNMP information can be managed. This includes not only physical devices but also software, such as web servers and databases.

Another aspect of network management is network monitoring ; that is, monitoring an entire network as opposed to individual routers, hosts, and other devices. Remote Network Monitoring (RMON ) was developed to help us understand how the network itself is functioning, as well as how individual devices on the network are affecting the network as a whole. It can be used to monitor not only LAN traffic, but WAN interfaces as well. We discuss RMON in more detail later in this chapter and in Chapter 2.

RFCs and SNMP Versions

The Internet Engineering Task Force (IETF) is responsible for defining the standard protocols that govern Internet traffic, including SNMP. The IETF publishes Requests for Comments (RFCs), which are specifications for many protocols that exist in the IP realm. Documents enter the standards track first as proposed standards, then move to draft status. When a final draft is eventually approved, the RFC is given standard status—although there are fewer completely approved standards than you might think. Two other standards-track designations, historical and experimental , define (respectively) a document that has been replaced by a newer RFC and a document that is not yet ready to become a standard. The following list includes all the current SNMP versions and the IETF status of each (see Appendix D for a full list of the SNMP RFCs):

  • SNMP Version 1 (SNMPv1 ) is the initial version of the SNMP protocol. It’s defined in RFC 1157 and is a historical IETF standard. SNMPv1’s security is based on communities, which are nothing more than passwords: plain-text strings that allow any SNMP-based application that knows the strings to gain access to a device’s management information. There are typically three communities in SNMPv1: read-only, read-write, and trap. It should be noted that while SNMPv1 is historical, it is still the primary SNMP implementation that many vendors support.

  • SNMP version 2 (SNMPv2 ) is often referred to as community-string-based SNMPv2. This version of SNMP is technically called SNMPv2c, but we will refer to it throughout this book simply as SNMPv2. It’s defined in RFC 3416, RFC 3417, and RFC 3418.

  • SNMP version 3 (SNMPv3 ) is the latest version of SNMP. Its main contribution to network management is security. It adds support for strong authentication and private communication between managed entities. In 2002, it finally made the transition from draft standard to full standard. The following RFCs define the standard: RFC 3410, RFC 3411, RFC 3412, RFC 3413, RFC 3414, RFC 3415, RFC 3416, RFC 3417, RFC 3418, and RFC 2576. Chapter 3 provides a thorough treatment of SNMPv3 and Chapter 6 goes through the SNMPv3 agent configuration for Net-SNMP and Cisco. While it is good news that SNMPv3 is a full standard, vendors are notoriously slow at adopting new versions of a protocol. While SNMPv1 has been transitioned to historical, the vast majority of vendor implementations of SNMP are SNMPv1 implementations. Some large infrastructure vendors like Cisco have supported SNMPv3 for quite some time, and we will undoubtedly begin to see more vendors move to SNMPv3 as customers insist on more secure means of managing networks.

The official site for RFCs is http://www.ietf.org/rfc.html. One of the biggest problems with RFCs, however, is finding the one you want. It is a little easier to navigate the RFC index at Ohio State University (http://www.cse.ohio-state.edu/cs/Services/rfc/index.html).

Managers and Agents

In the previous sections, we’ve vaguely referred to SNMP-capable devices and network management stations. Now it’s time to describe what these two things really are. In the world of SNMP, there are two kind of entities: managers and agents . A manager is a server running some kind of software system that can handle management tasks for a network. Managers are often referred to as Network Management Stations (NMSs).[*] An NMS is responsible for polling and receiving traps from agents in the network. A poll, in the context of network management, is the act of querying an agent (router, switch, Unix server, etc.) for some piece of information. This information can be used later to determine if some sort of catastrophic event has occurred. A trap is a way for the agent to tell the NMS that something has happened. Traps are sent asynchronously, not in response to queries from the NMS. The NMS is further responsible for performing an action[] based upon the information it receives from the agent. For example, when your T1 circuit to the Internet goes down, your router can send a trap to your NMS. In turn, the NMS can take some action, perhaps paging you to let you know that something has happened.

The second entity, the agent, is a piece of software that runs on the network devices you are managing. It can be a separate program (a daemon, in Unix language), or it can be incorporated into the operating system (for example, Cisco’s IOS on a router, or the low-level operating system that controls a UPS). Today, most IP devices come with some kind of SNMP agent built in. The fact that vendors are willing to implement agents in many of their products makes the system administrator’s or network manager’s job easier. The agent provides management information to the NMS by keeping track of various operational aspects of the device. For example, the agent on a router is able to keep track of the state of each of its interfaces: which ones are up, which ones are down, etc. The NMS can query the status of each interface and take appropriate action if any of them are down. When the agent notices that something bad has happened, it can send a trap to the NMS. This trap originates from the agent and is sent to the NMS, where it is handled appropriately. Some devices will send a corresponding “all clear” trap when there is a transition from a bad state to a good state. This can be useful in determining when a problem situation has been resolved. Figure 1-1 shows the relationship between the NMS and an agent.

Relationship between an NMS and an agent
Figure 1-1. Relationship between an NMS and an agent

It’s important to keep in mind that polls and traps can happen at the same time. There are no restrictions on when the NMS can query the agent or when the agent can send a trap.

The Structure of Management Information and MIBs

The Structure of Management Information (SMI ) provides a way to define managed objects and their behavior. An agent has in its possession a list of the objects that it tracks. One such object is the operational status of a router interface (for example, up, down, or testing). This list collectively defines the information the NMS can use to determine the overall health of the device on which the agent resides.

The Management Information Base (MIB) can be thought of as a database of managed objects that the agent tracks. Any sort of status or statistical information that can be accessed by the NMS is defined in a MIB. The SMI provides a way to define managed objects while the MIB is the definition (using the SMI syntax) of the objects themselves. Like a dictionary, which shows how to spell a word and then gives its meaning or definition, a MIB defines a textual name for a managed object and explains its meaning. Chapter 2 goes into more technical detail about MIBs and the SMI.

An agent may implement many MIBs, but all agents implement a particular MIB called MIB-II [*] (RFC 1213). This standard defines variables for things such as interface statistics (interface speeds, MTU, octets[*] sent, octets received, etc.) as well as various other things pertaining to the system itself (system location, system contact, etc.). The main goal of MIB-II is to provide general TCP/IP management information. It doesn’t cover every possible item a vendor may want to manage within its particular device.

What other kinds of information might be useful to collect? First, many draft and proposed standards have been developed to help manage things such as frame relay, ATM, FDDI, and services (mail, Domain Name System (DNS), etc.). A sampling of these MIBs and their RFC numbers includes:

  • ATM MIB (RFC 2515)

  • Frame Relay DTE Interface Type MIB (RFC 2115)

  • BGP Version 4 MIB (RFC 1657)

  • RDBMS MIB (RFC 1697)

  • RADIUS Authentication Server MIB (RFC 2619)

  • Mail Monitoring MIB (RFC 2789)

  • DNS Server MIB (RFC 1611)

But that’s far from the entire story, which is why vendors, and individuals, are allowed to define MIB variables for their own use.[] For example, consider a vendor that is bringing a new router to market. The agent built into the router will respond to NMS requests (or send traps to the NMS) for the variables defined by the MIB-II standard; it probably also implements MIBs for the interface types it provides (e.g., RFC 2515 for ATM and RFC 2115 for Frame Relay). In addition, the router may have some significant new features that are worth monitoring but are not covered by any standard MIB. So, the vendor defines its own MIB (sometimes referred to as a proprietary MIB) that implements managed objects for the status and statistical information of its new router.

Tip

Simply loading a new MIB into your NMS does not necessarily allow you to retrieve the data/values/objects, etc., defined within that MIB. You need to load only those MIBs supported by the agents from which you’re requesting queries (e.g., snmpget, snmpwalk). Feel free to load additional MIBs for future device support, but don’t panic when your device doesn’t answer (and possibly returns errors for) these unsupported MIBs.

Host Management

Managing host resources (disk space, memory usage, etc.) is an important part of network management. The distinction between traditional system administration and network management has been disappearing over the last decade and is now all but gone. As Sun Microsystems puts it, “The network is the computer.” If your web server or mail server is down, it doesn’t matter whether your routers are running correctly—you’re still going to get calls. The Host Resources MIB (RFC 2790) defines a set of objects to help manage critical aspects of Unix and Windows systems.[*]

Some of the objects supported by the Host Resources MIB include disk capacity, number of system users, number of running processes, and software currently installed. Today, more and more people are relying on service-oriented web sites. Making sure your backend servers are functioning properly is as important as monitoring your routers and other communications devices.

Unfortunately, some agent implementations for these platforms do not implement this MIB since it’s not required.

A Brief Introduction to Remote Monitoring (RMON)

Remote Monitoring Version 1 (RMONv1, or RMON) is defined in RFC 2819; an enhanced version of the standard, called RMON Version 2 (RMONv2), is defined in RFC 2021. RMONv1 provides the NMS with packet-level statistics about an entire LAN or WAN. RMONv2 builds on RMONv1 by providing network- and application-level statistics. These statistics can be gathered in several ways. One way is to place an RMON probe on every network segment you want to monitor. Some Cisco routers have limited RMON capabilities built in, so you can use their functionality to perform minor RMON duties. Likewise, some 3Com switches implement the full RMON specification and can be used as full-blown RMON probes.

The RMON MIB was designed to allow an actual RMON probe to run in an offline mode that allows the probe to gather statistics about the network it’s watching without requiring an NMS to query it constantly. At some later time, the NMS can query the probe for the statistics it has been gathering. Another feature that most probes implement is the ability to set thresholds for various error conditions and, when a threshold is crossed, alert the NMS with an SNMP trap. You can find a little more technical detail about RMON in the next chapter.

The Concept of Network Management

SNMP is really about network management. Network management is a discipline of its own, but before learning about the details of SNMP in Chapter 2, it’s helpful to have an overview of network management itself.

What is network management? Network management is a general concept that employs the use of various tools, techniques, and systems to aid human beings in managing various devices, systems, or networks. Let’s take SNMP out of the picture right now and look at a model for network management called FCAPS, or Fault Management, Configuration Management, Accounting Management, Performance Management, and Security Management. These conceptual areas were created by the International Organization for Standardization (ISO) to aid in the understanding of the major functions of network management systems. Let’s briefly look at each of these now.

Fault Management

The goal of fault management is to detect, log, and notify users of systems or networks of problems. In many environments, downtime of any kind is not acceptable.

Fault management dictates that these steps for fault resolution be followed:

  1. Isolate the problem by using tools to determine symptoms.

  2. Resolve the problem.

  3. Record the process that was used to detect and resolve the problem.

While step 3 is important, it is often not used. Neglecting step 3 has the unwanted effect of causing new engineers to follow steps 1 and 2 in the dark when they could have consulted a database of troubleshooting tips.

Configuration Management

The goal of configuration management is to monitor network and system configuration information so that the effects on network operation of various versions of hardware and software elements can be tracked and managed.

Any system may have a number of interesting and pertinent configuration parameters that engineers may be interested in capturing, including:

  • Version of operating system, firmware, etc.

  • Number of network interfaces and speeds, etc.

  • Number of hard disks

  • Number of CPUs

  • Amount of RAM

This information generally is stored in a database of some kind. As configuration parameters change for systems, this database is updated. An added benefit to having this data store is that it can aid in problem resolution.

Accounting Management

The goal of accounting management is to ensure that computing and network resources are used fairly by all groups or individuals who access them. Through this form of regulation, network problems can be minimized since resources are divided based on capacities.

Performance Management

The goal of performance management is to measure and report on various aspects of network or system performance.

Let’s look at the steps involved in performance management:

  1. Performance data is gathered.

  2. Baseline levels are established based on analysis of the data gathered.

  3. Performance thresholds are established. When these thresholds are exceeded, it is indicative of a problem that requires attention.

One example of performance management is service monitoring. For example, an Internet service provider (ISP) may be interested in monitoring its email service response time. This includes sending emails via SMTP and getting email via POP3. See Chapter 11 for examples of how to do this.

Security Management

The goal of security management is twofold. First, we wish to control access to some resource, such as a network and its hosts. Second, we wish to help detect and prevent attacks that can compromise networks and hosts. Attacks against networks and hosts can lead to denial of service and, even worse, allow hackers to gain access to vital systems that contain accounting, payroll, and source code data.

Security management encompasses not only network security systems but also physical security. Physical security includes card access and video surveillance systems. The goal here is to ensure that only authorized individuals have physical access to vulnerable systems.

Today, network security management is accomplished through the use of various tools and systems designed specifically for this purpose. These include:

  • Firewalls

  • Intrusion Detection Systems (IDSs)

  • Intrusion Prevention Systems (IPSs)

  • Antivirus systems

  • Policy management and enforcement systems

Most if not all of today’s network security systems can integrate with network management systems via SNMP.

Applying the Concepts of Network Management

Being able to apply the concepts of network management is as important as learning how to use SNMP. This section of the chapter provides insights into some of the issues surrounding network management.

Business Case Requirements

The endeavor of network management involves solving a business problem through an implementation of some sort. A business case is developed to understand the impact of implementing some sort of task or function. It looks at how, for example, network administrators do their day-to-day jobs. The basic idea is to reduce costs and increase effectiveness. If the implementation doesn’t save a company any money while providing more effective services, there is almost no need to implement a given solution.

Levels of Activity

Before applying management to a specific service or device, you must understand the four possible levels of activity and decide what is appropriate for that service or device:

Inactive

No monitoring is being done, and, if you did receive an alarm in this area, you would ignore it.

Reactive

No monitoring is being done; you react to a problem if it occurs.

Interactive

You monitor components but must interactively troubleshoot them to eliminate side-effect alarms and isolate a root cause.

Proactive

You monitor components, and the system provides a root-cause alarm for the problem at hand and initiates predefined automatic restoral processes where possible to minimize downtime.

Reporting of Trend Analysis

The ability to monitor a service or system proactively begins with trend analysis and reporting . Chapters 12 and 13 describe two tools that are capable of aiding in trend reporting. In general, the goal of trend analysis is to identify when systems, services, or networks are beginning to reach their maximum capacity, with enough lead time to do something about it before it becomes a real problem for end users. For example, you may discover a need to add more memory to your database server or upgrade to a newer version of some application server software that adds a performance boost. Doing so before it becomes a real problem can help your users avoid frustration and possibly keep you employed.

Response Time Reporting

If you are responsible for managing any sort of server (HTTP, SMTP, etc.), you know how frustrating it can be when users come knocking on your door to say that the web server is slow or that surfing the Internet is slow. Response time reporting measures how various aspects of your network (including systems) are performing with respect to responsiveness. Chapter 11 shows how to monitor services with SNMP.

Alarm Correlation

Alarm correlation deals with narrowing down many alerts and events into a single alert or several events that depict the real problem. Another name for this is root-cause analysis. The idea is simple, but it tends to be difficult in practice. For example, when a web server on your network goes down, and you are managing all devices between you and the server (including the switch the server is on and the router), you may get any number of alerts including ones for the server being down, the switch being down, or the router being down, depending on where the real failure is.

Let’s say the router is the real issue (for example, an interface card died). You really only need to know that the router is down. Network management systems can often detect when some device or network is unreachable due to varying reasons. The key in this situation is to correlate the server, switch, and router down events into a single high-level event detailing that the router is down. This high-level event can be made up of all the entities and their alarms that are affected by the router being down, but you want to shield an operator from all of these until he is interested in looking at them. The real problem that needs to be addressed is the router’s failure. Keeping this storm of alerts and alarms away from the operator helps with overall efficiency and improves the trouble resolution capabilities of the staff.

Clearing alarms is also important. For example, once the router is back up and running, presumably it’s going to send an SNMP message that it has come back to life, or maybe a network management system will discover that it’s back up and create an alarm to this effect. This notion of state transition, from bad to good, is common. It helps operators know that something is indeed up and operational. It also helps with trending. If you see that a certain device is constantly unreliable, you may want to investigate why.

Trouble Resolution

The key to trouble resolution is knowing that what you are looking at is valuable and can help you resolve the problem. As such, alarms and alerts should aid an operator in resolving the problem. For example, when your router goes down, a cryptic message like “router down” is not helpful. If possible, alerts and alarms should provide the operator with enough detail so that she can effectively troubleshoot and resolve the problem.

Change Management

Change management deals with, well, managing change. In other words, you need to plan for both scheduled and emergency changes to your network. Not doing so can cause networks and systems to be unreliable at best and can upset the very people you work for at worst. The following sections provide a high-level overview of change management techniques. The following techniques are recommended by Cisco. See the end of this section for the URL to this paper and others on the topic of network management.

Planning for Change

Change planning is a process that identifies the risk level of a change and builds change planning requirements to ensure that the change is successful. The key steps for change planning are as follows:

  • Assign all potential changes a risk level prior to scheduling the change.

  • Document at least three risk levels with corresponding change planning requirements. Identify risk levels for software and hardware upgrades, topology changes, routing changes, configuration changes, and new deployments. Assign higher risk levels to nonstandard add, move, or change types of activity.

  • The high-risk change process you document needs to include lab validation, vendor review, peer review, and detailed configuration and design documentation.

  • Create solution templates for deployments affecting multiple sites. Include information about physical layout, logical design, configuration, software versions, acceptable hardware chassis and modules, and deployment guidelines.

  • Document your network standards for configuration, software version, supported hardware, and DNS. Additionally, you may need to document things like device naming conventions, network design details, and services supported throughout the network.

Managing Change

Change management is a process that approves and schedules the change to ensure the correct level of notification with minimal user impact. The key steps for change management are as follows:

  • Assign a change controller who can run change management review meetings, receive and review change requests, manage change process improvements, and act as a liaison for user groups.

  • Hold periodic change review meetings with system administration, application development, network operations, and facilities groups as well as general users.

  • Document change input requirements, including change owner, business impact, risk level, reason for change, success factors, backout plan, and testing requirements.

  • Document change output requirements, including updates to DNS, network map, template, IP addressing, circuit management, and network management.

  • Define a change approval process that verifies validation steps for higher-risk change.

  • Hold postmortem meetings for unsuccessful changes to determine the root cause of change failure.

  • Develop an emergency change procedure that ensures that an optimal solution is maintained or quickly restored.

High-Level Process Flow for Planned Change Management

The steps you’ll need to follow during a network change are represented in Figure 1-2.[*] The following sections briefly discuss each box in the flow.

Scope

Scope is the who, what, where, and how for the change. In other words, you need to detail every possible impact point for the change, especially its impact on people.

Risk assessment

Everything you do to or on a network, when it comes to change, has an associated risk. The person requesting the change needs to establish the risk level for the change. It is best to experiment in a lab setting if you can before you go live with a change. This can help identify problems and aid in risk evaluation.

Process flow for planned change management
Figure 1-2. Process flow for planned change management

Test and validation

With any proposed change, you want to make sure you have all of your bases covered. Rigorous testing and validation can help with this. Depending upon the associated risk, various levels of validation may need to be performed. For example, if the change has the potential to impact a great many systems, you may wish to test the change in a lab setting. If the change doesn’t work, you may also need to document backout procedures.

Change planning

For a change to be successful, you must plan for it. This includes requirements gathering, ordering software or hardware, creating documentation, and coordinating human resources.

Change controller

Basically, a change controller is a person who is responsible for coordinating all details of the change process.

Change management team

You should create a change management team that includes representation from network operations, server operations, application support, and user groups within your organization. The team should review all change requests and approve or deny each request based on completeness, readiness, business impact, business need, and any other conflicts.

Tip

The change management team does not investigate the technical accuracy of the change; technical experts who better understand the scope and technical details should complete this phase of the change process.

Communication

Many organizations, even small ones, fail to communicate their intentions. Make sure you keep people who may be affected up-to-date on the status of the changes.

Implementation team

You should create an implementation team consisting of individuals with the technical expertise to expedite a change. The implementation team should also be involved in the planning phase to contribute to the development of the project checkpoints, testing, backout criteria, and backout time constraints. This team should guarantee adherence to organizational standards, update DNS and network management tools, and maintain and enhance the tool set used to test and validate the change.

Test evaluation of change

Once the change has been made, you should begin testing it. Hopefully you already have a set of tests documented that can be used to validate the change. Make sure you allow yourself enough time to perform the tests. If you must back out the change, make sure you test this scenario, too.

Network management update

Be sure to update any systems like network management tools, device configurations, network configurations, DNS entries, etc., to reflect the change. This may include removing devices from the management systems that no longer exist, changing the SNMP trap destination your routers use, and so forth.

Documentation

Always update documentation that becomes obsolete or incorrect when a change occurs. Documentation may end up being used by a network administrator to solve a problem. If it isn’t up-to-date, he cannot be effective in his duties.

High-Level Process Flow for Emergency Change Management

In the real world, change often comes at 2 a.m. when some critical system is down. But with some effort, your on-the-fly change doesn’t have to cause heartburn for you and others in the company. Documentation means a lot more during emergency changes than it does in planned changes. In the heat of the moment, things can get lost or forgotten. Accurately recording the steps and procedures taken will ensure that troubles can be resolved in the future. If you have to, take short notes while the process is unfolding. Later, write it up formally; the important thing is to remember to do it.

Figure 1-3 shows the process flow for emergency changes.[*]

Emergency change process
Figure 1-3. Emergency change process

Issue determination

Knowing what needs to change is generally not difficult to determine in an emergency. The key is to take one step at a time and not rush things. Yes, time is critical, but rushing can cause mistakes to be made or even bring about a resolution that doesn’t fix the real issue. In some cases, the outage can be unnecessarily prolonged.

Limited risk assessment

Risk assessment is performed by the network administrator on duty, with advice from other support personnel. Her experience will guide her in how the change is classified from a risk perspective. For example, changing the version of software on a router has much greater impact than changing a device’s IP address.

Communication and documentation

If at all possible, users should be notified of the change. In an emergency situation, it isn’t always possible. Also, be sure to communicate any changes with the change manager. The manager will wish to add to any metrics he keeps on changes. Ensuring that documentation is up-to-date cannot be stressed enough. Having out-of-date documentation means that the staff cannot accurately troubleshoot network and systems problems in the future.

Implementation

If the process of assigning risk and documentation occurs prior to the implementation, the actual implementation should be straightforward. Beware of the potential for change coming from multiple support personnel without their knowing about each other’s changes. This scenario can lead to increased potential downtime and misinterpretation of the problem.

Test and evaluation

Be sure to test the change. Generally, the person who implemented the change also tests and evaluates it. The primary goal is to determine whether the change had the desired effect. If it did not, the emergency change process must be restarted.

Before and After SNMP

Now that you have an idea about what SNMP and network management are, we should look at the before and after pictures for implementing these concepts and technologies. Let’s say that you have a network of 100 machines running various operating systems. Several machines are fileservers, a few others are print servers, another is running software that verifies credit card transactions (presumably from a web-based ordering system), and the rest are personal workstations. In addition, various switches and routers help keep the network going. A T1 circuit connects the company to the Internet, and a private connection runs to the credit card verification system.

What happens when one of the fileservers crashes? If it happens in the middle of the workweek, the people using it will notice and the appropriate administrator will be called to fix it. But what if it happens after everyone has gone home, including the administrators, or over the weekend?

What if the private connection to the credit card verification system goes down at 10 p.m. on Friday and isn’t restored until Monday morning? If the problem was faulty hardware and it could have been fixed by swapping out a card or replacing a router, thousands of dollars in web site sales could have been lost for no reason. Likewise, if the T1 circuit to the Internet goes down, it could adversely affect the amount of sales generated by individuals accessing your web site and placing orders.

These are obviously serious problems—problems that can conceivably affect the survival of your business. This is where SNMP comes in. Instead of waiting for someone to notice that something is wrong and locate the person responsible for fixing the problem (which may not happen until Monday morning, if the problem occurs over the weekend), SNMP allows you to monitor your network constantly, even when you’re not there. For example, it will notice if the number of bad packets coming through one of your router’s interfaces is gradually increasing, suggesting that the router is about to fail. You can arrange to be notified automatically when failure seems imminent so that you can fix the router before it actually breaks. You can also arrange to be notified if the credit card processor appears to get hung—you may even be able to fix it from home. And if nothing goes wrong, you can return to the office on Monday morning knowing there won’t be any surprises.

There might not be quite as much glory in fixing problems before they occur, but you and your management will rest more easily. We can’t tell you how to translate that into a higher salary—sometimes it’s better to be the guy who rushes in and fixes things in the middle of a crisis, rather than the guy who makes sure the crisis never occurs. But SNMP does enable you to keep logs that prove your network is running reliably and show when you took action to avert an impending crisis.

Staffing Considerations

Implementing a network management system can mean adding more staff to handle the increased load of maintaining and operating such an environment. At the same time, adding this type of monitoring should, in most cases, reduce the workload of your system administration staff. You will need:

  • Staff to maintain the management station. This includes ensuring the management station is configured to properly handle events from SNMP-capable devices.

  • Staff to maintain the SNMP-capable devices. This includes making sure that workstations and servers can communicate with the management station.

  • Staff to watch and fix the network. This group is usually called a Network Operations Center (NOC) and is staffed 24/7. An alternative to 24/7 staffing is to implement rotating pager duty, where one person is on call at all times, but not necessarily present in the office. Pager duty works only in smaller networked environments in which a network outage can wait for someone to drive into the office and fix the problem.

There is no way to predetermine how many staff members you will need to maintain a management system. The size of the staff will vary depending on the size and complexity of the network you’re managing. Some of the larger Internet backbone providers have 70 or more people in their NOCs and others have only one.

Getting More Information

Getting a handle on SNMP may seem like a daunting task. The RFCs provide the official definition of the protocol, but they were written for software developers, not network administrators, so it can be difficult to extract the information you need from them. Fortunately, many online resources are available. A good place to look is the SimpleWeb (http://www.simpleweb.org). SNMP Link (http://www.SNMPLink.org) is another good site for information. The Simple Times, an online publication devoted to SNMP and network management, is also useful. You can find all the issues ever published[*] at http://www.simple-times.org. SNMP Research is a commercial SNMP vendor. Aside from selling advanced SNMP solutions, its web site contains a good amount of free information about SNMP. The company’s web site is http://www.snmp.com.

Another great resource is Usenet news. The newsgroup most people frequent is comp.dcom.net-management. Another good newsgroup is comp.protocols.snmp. Groups such as these promote a community of information sharing, allowing seasoned professionals to interact with individuals who are not as knowledgeable about SNMP or network management. Google has a great interface for searching Usenet news group at http://groups.google.com.

There is an SNMP FAQ, available in two parts at http://www.faqs.org/faqs/snmp-faq/part1/ and http://www.faqs.org/faqs/snmp-faq/part2/.

Cisco has some very good papers on network management, including “Network Management Basics” (http://www.cisco.com/univercd/cc/td/doc/cisintwk/ito_doc/nmbasics.htm) and “Change Management,” from which Figure 1-2 and Figure 1-3 were drawn. Also, Douglas W. Stevenson’s article, “Network Management: What It Is and What It Isn’t,” available at http://www.itmweb.com/essay516.htm, provides important background material for all students of network management.

With that background in mind, Chapter 2 delves much deeper into the details of SNMP.



[*] See Appendix F for a listing of some popular NMS applications.

[] Note that the NMS is preconfigured to perform this action.

[*] MIB-I is the original version of this MIB, but it is no longer referred to since MIB-II enhances it.

[*] An octet is an 8-bit quantity, which is the fundamental unit of transfer in TCP/IP networks.

[] This topic is discussed further in the next chapter.

[*] Any operating system running an SNMP agent can implement Host Resources; it’s not confined to agents running on Unix and Windows systems.

[*] Reprinted by permission from Cisco’s “Change Management: Best Practices White Paper,” Document ID 22852, http://www.cisco.com/warp/public/126/chmgmt.shtml.

[*] Reprinted by permission from Cisco’s “Change Management: Best Practices White Paper,” Document ID 22852, http://www.cisco.com/warp/public/126/chmgmt.shtml.

[*] At this writing, the current issue is quite old, published in December 2002.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required