Now that you understand the basic concepts behind how network management stations (NMSs) and agents communicate, it's time to introduce the concept of a network management architecture . Before rushing out to deploy SNMP management, you owe it to yourself to put some effort into developing a coherent plan. If you simply drop NMS software on a few of your favorite desktop machines, you're likely to end up with something that doesn't work very well. By NMS architecture, we mean a plan that helps you use NMSs effectively to manage your network. A key component of network management is selecting the proper hardware (i.e., an appropriate platform on which to run your NMS) and making sure that your management stations are located in such a way that they can observe the devices on your network effectively.
Managing a reasonably large network requires an NMS with substantial computing power. In today's complex networked environments, networks can range in size from a few nodes to thousands of nodes. The process of polling and receiving traps from hundreds or thousands of managed entities can be taxing on the best of hardware. Your NMS vendor will be able to help you determine what kind of hardware is appropriate for managing your network. Most vendors have formulas for determining how much RAM you will need to achieve the level of performance you want, given the requirements of your network. It usually boils down to the number of devices you want to poll, the amount of information you will request from each device, and the interval at which you want to poll them. The software you want to run is also a consideration. NMS products such as OpenView are large, heavyweight applications; if you want to run your own scripts with Perl, you can get away with a much smaller management platform.
Is it possible to say something more helpful than "ask your vendor"? Yes. First, although we've become accustomed to thinking of NMS software as requiring a midrange workstation or high-end PC, desktop hardware has advanced so much in the past year or two that running this software is within the range of any modern PC. Specifically, surveying the recommendations of a number of vendors, we have found that they suggest a PC with at least a 2 or 3 GHz CPU, 512 MB to 1 GB of memory, and 1-2 GB of disk space. Requirements for Sun SPARC and HP workstations are similar.
Let's look at each of these requirements:
This is well within the range of any modern desktop system, but you probably can't bring your older equipment out of retirement to use as a management station.
You'll probably have to add memory to any off-the-shelf PC; Sun and HP workstations come with more generous memory configurations. Frankly, vendors tend to underestimate memory requirements anyway, so it won't hurt to upgrade to 2 GB. Fortunately, RAM is usually cheap these days, though memory prices fluctuate from day to day.
This recommendation is probably based on the amount of space you'll need to store the software, and not on the space you'll need for logfiles, long-term trend data, etc. But again, disk space is cheap these days, and skimping is counterproductive.
Let's think a bit more about how long-term data collection affects your disk requirements. First, you should recognize that some products have only minimal data-collection facilities, while others exist purely for the purpose of collecting data (for example, MRTG). Whether you can do data collection effectively depends to some extent on the NMS product you've selected. Therefore, before deciding on a software product, you should think about your data-collection requirements. Do you want to do long-term trend analysis? If so, that will affect both the software you choose and the hardware on which you run it.
For a starting point, let's say that you have 1,000 nodes, you want to collect data every minute, and you're collecting 1 KB of data per node. That's 1 MB per minute, 1.4 GB per day—you'll fill a 40GB disk in about a month. That's bordering on extravagant. But let's look at the assumptions:
Collecting data every minute is certainly excessive; every 10 minutes should do. Now your 40GB disk will store almost a year's worth of data.
A network with 1,000 nodes isn't that big. But do you really want to store trend data for all your users' PCs? Much of this book is devoted to showing you how to control the amount of data you collect. Instead of 1,000 nodes, let's first count interfaces. And let's forget about desktop systems—we really care about trend data for our network backbone: key servers, routers, switches, etc. Even on a midsize network, we're probably talking about 100 or 200 interfaces.
The amount of data you collect per interface depends on many factors, not the least of which is the format of the data. An interface's status may be up or down—that's a single bit. If it's being stored in a binary data structure, it may be represented by a single bit. But if you're using syslog to store your log data and writing Perl scripts to do trend analysis, your syslog records are going to be 80 bytes or so, even if you are storing only 1 bit of information. Data-storage mechanisms range from syslog to fancy database schemes—you obviously need to understand what you're using, and how it will affect your storage requirements. Furthermore, you need to understand how much information you really want to keep per interface. If you want to track only the number of octets going in and out of each interface and you're storing this data efficiently, your 40GB disk could easily last the better part of a century.
Seriously, it's hard to estimate your storage requirements when they vary over two or three orders of magnitude. But the lesson is that no vendor can tell you what your storage requirements will be. A gigabyte should be plenty for log data on a moderately large network, if you're storing data only for a reasonable subset of that network, not polling too often, and not saving too much data. But that's a lot of variables, and you're the only one in control of them. Keep in mind, though, that the more data you collect, the more time and CPU power will be required to grind through all that data and produce meaningful results. It doesn't matter whether you're using expensive trend-analysis software or some homegrown scripts—processing lots of data is expensive. At least in terms of long-term data collection, it's probably better to err by keeping too little data around than by keeping too much.
Before going out and buying all your equipment, it's worth spending some time coming up with an architecture for your network that will make it more manageable. The simplest architecture has a single management station that is responsible for the entire network, as shown in Figure 4-1.
The network depicted in Figure 4-1 has three sites: New York, Atlanta, and San Jose. The NMS in New York is responsible for managing not only the portion of the network in New York, but also those in Atlanta and San Jose. Traps sent from any device in Atlanta or San Jose must travel over the Internet to get to the NMS in New York. The same thing goes for polling devices in San Jose and Atlanta: the NMS in New York must send its requests over the Internet to reach these remote sites. For small networks, an architecture like this can work well. However, when the network grows to the point that a single NMS can no longer manage everything, this architecture becomes a real problem. The NMS in New York can get behind in its polling of the remote sites, mainly because it has so much to manage. The result is that when problems arise at a remote site, they may not get noticed for some time. In the worst case, they might not get noticed at all.
It's also worth thinking about staffing. With a single NMS, your primary operations staff would be in New York, watching the health of the network. But problems frequently require somebody on-site to intervene. This requires someone in Atlanta and San Jose, plus the coordination that entails. You may not need a full-time network administrator, but you will need someone who knows what to do when a router fails.
When your network grows to a point where one NMS can no longer manage everything, it's time to move to a distributed NMS architecture. The idea behind this architecture is simple: use two or more management stations and locate them as close as possible to the nodes they are managing. In the case of our three-site network, we would have an NMS at each site. Figure 4-2 shows the addition of two NMSs to the network.
This architecture has several advantages, not the least of which is flexibility. With the new architecture, the NMSs in Atlanta and San Jose can act as standalone management stations, each with a fully self-sufficient staff, or they can forward events to the NMS in New York. If the remote NMSs forward all events to the NMS in New York, there is no need to put additional operations staff in Atlanta and San Jose. At first glance, this looks like we've returned to the situation of Figure 4-1, but that isn't quite true. Most NMS products provide some kind of client interface for viewing the events currently in the NMS (traps received, responses to polls, etc.). Since the NMS that forwards events to New York has already discovered the problem, we're simply letting the NMS in New York know about it so that it can be dealt with appropri-
ately. The New York NMS didn't have to use valuable resources to poll the remote network to discover that there was a problem.
The other advantage is that, if the need arises, you can put operations staff in Atlanta and San Jose to manage each of these remote locations. If New York loses connectivity to the Internet, events forwarded from Atlanta or San Jose will not make it to New York. With operations staff in Atlanta and San Jose, and the NMSs at these locations acting in standalone mode, a network outage in New York won't matter. The remote-location staff will continue as if nothing has happened.
Another possibility with this architecture is a hybrid mode: you staff the operations center in New York 24 hours a day, 7 days a week, but you staff Atlanta and San Jose only during business hours. During off-hours, they rely on the NMS and operations staff in New York to notice and handle problems that arise. But during the critical (and busiest) hours of the day, Atlanta and San Jose don't have to burden the New York operators.
Both of the architectures we have discussed use the Internet to send and receive management traffic. This poses several problems, mainly dealing with security and overall reliability. A better solution is to use private links to perform all your network management functions. Figure 4-3 shows how the distributed NMS architecture can be extended to make use of such links.
Let's say that New York's router is the core router for the network. We establish private (but not necessarily high-speed) links between San Jose and New York, and between New York and Atlanta. This means that San Jose will not only be able to reach New York, but it will also be able to reach Atlanta via New York. Atlanta will use New York to reach San Jose, too. The private links (denoted by thicker router-to-router connections) are primarily devoted to management traffic, though we could put them to other uses. Using private links has the added benefit that our community strings are never sent out over the Internet. The use of private network links for network management works equally well with the single NMS architecture, too. Of course, if your corporate network consists entirely of private links and your Internet connections are devoted to external traffic only, using private links for your management traffic is the proverbial "no-brainer."
One final item worth mentioning is the notion of trap-directed polling . This doesn't really have anything to do with NMS architecture, but it can help to alleviate an NMS's management strain. The idea behind trap-directed polling is simple: the NMS receives a trap and initiates a poll to the device that generated the trap. The goal of this scenario is to determine whether there is indeed a problem with the device while allowing the NMS to ignore (or devote few resources to) the device in normal operation. If an organization relies on this form of management, it should implement it in such a way that non-trap-directed polling is almost done away with. That is, it should avoid polling devices at regular intervals for status information. Instead, the management stations should simply wait to receive a trap before polling a device. This form of management can significantly reduce the resources needed by an NMS to manage a network. However, it has an important disadvantage: traps can get lost in the network and never make it to the NMS. This is a reality of the connectionless nature of UDP and the imperfect nature of networks .
Web-based network management entails the use of the HyperText Transfer Protocol (HTTP) and the Common Gateway Interface (CGI) to manage networked entities. It works by embedding a web server in an SNMP-compatible device, along with a CGI engine to convert SNMP-like requests (from a web-based NMS) to actual SNMP operations, and vice versa. Web servers can be embedded into such devices at very low monetary and operating cost.
Figure 4-4 is a simplified diagram of the interaction between a web-based NMS and a managed device. The CGI application bridges the gap between the management application and the SNMP engine. In some cases, the management application can be a collection of Java applets that are downloaded to the web browser and executed on the web-based manager. Current versions of OpenView ship with a web-based GUI. SNMPc also has web-based capabilities. They have a Java client for the network management console and the recently released SNMPc Online, which is a web-based reporting frontend.
Web-based network management could eliminate, or at least reduce, the need for traditional NMS software. NMS software can be expensive to purchase, set up, and maintain. Most of today's major NMS vendors support only a few popular versions of Unix and have only recently begun to support Windows, thus limiting your operating-system choices. With a web-based NMS, however, these two concerns are moot. For the most part, web browsers are free, and Unix, Windows, and Apple platforms all run the popular browsers.
Web-based network management should not be viewed as a panacea, though. It is a good idea, but it will take some time for vendors to embrace this technology and move toward web integration of their existing products. There is also the issue of standardization, or the lack of it. The Web-Based Enterprise Management (WBEM) Initiative addresses this by defining a standard for web-based management. Industry leaders such as Cisco and BMC Software are among the original founders of WBEM. You can learn more about this initiative at the Distributed Management Task Force 's web page, http://www.dmtf.org/standards/wbem.
Another important standard in this area is XML (eXtensible Markup Language) . XML is a markup language used for the interchange of structured data. XML makes use of DTDs (Document Type Definitions) or schemas to specify a document's structure and, in the case of schemas, to validate data. A DTD or schema is similar to an SNMP MIB. XML may be used for network management purposes in the following scenarios:
In environments where UDP traffic isn't permissible, XML can be used as an intermediary application-level protocol. Of course, this requires a mapping layer to translate from XML to SNMP and vice versa.
This has the distinct advantage of allowing languages and systems that support XML parsing to access MIB information. Java is a language that can easily interact with XML.
While this may seem like a perversion of what XML was originally intended for, applications are being written that use XML as an application-level protocol for not only exchanging messages, but also sending control messages.
As new technology comes to the forefront, SNMP researchers, vendors, and users will embrace it whenever it makes sense. This is evidenced by the adoption of SNMPv3 as well as by the use of web technologies for tackling the problems presented by the ever-expanding scope of network management.