BUY THIS BOOK
Add to Cart

Print Book $39.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint this content?


BGP
BGP Building Reliable Networks with the Border Gateway Protocol

By Iljitsch van Beijnum
Price: $39.95 USD
£28.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: The Internet, Routing, and BGP
One of the many remarkable qualities of the Internet is that it has scaled so well to its current size. This doesn't mean that nothing has changed since the early days of the ARPANET in 1969. The opposite is true: our current TCP and IP protocols weren't constructed until the late 1970s. Since that time, TCP/IP has become the predominant networking protocol for just about every kind of digital communication.
The story goes that the Internet—or rather the ARPANET, which is regarded as the origin of today's Internet—was invented by the military as a network that could withstand a nuclear attack. That isn't how it actually happened. In the early 1960s, Paul Baran, a researcher for the RAND Corporation, wrote a number of memoranda proposing a digital communications network for military use that could still function after sustaining heavy damage from an enemy attack. Using simulations, Baran proved that a network with only three or four times as many connections as the minimum required to operate comes close to the theoretical maximum possible robustness. This of course implies that the network adapts when connections fail, something the telephone network and the simple digital connections of that time couldn't do, because every connection was manually configured. Baran incorporated numerous revolutionary concepts into his proposed network: packet switching, adaptive routing, the use of digital circuits to carry voice communication, and encryption inside the network. Many people believed such a network couldn't work, and it was never built.
Several years later, the Department of Defense's Advanced Research Project Agency (ARPA) grew unsatisfied with the fact that many universities and other research institutions that worked on ARPA projects were unable to easily exchange results on computer-related work. Because computers from the many different vendors used different operating systems and languages, and because they were usually customized to some extent by their users, it was extremely hard to make a program developed on one computer run on another machine. ARPA wanted a network that would enable researchers to access computers located at different research institutions throughout the United States.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Topology of the Internet
Because it's a "network of networks," there was always a need to interconnect the different networks that together form the global Internet. In the beginning, everyone simply connected to the ARPANET, but over the years, the topology of the Internet has changed radically.
During the late 1980s, the ARPANET was replaced as the major "backbone" of the Internet by a new National Science Foundation-sponsored network between five supercomputer locations: the NSFNET Backbone. Federal Internet Exchanges on the East and West Coasts (FIX East and FIX West) were built in 1989 to aid in the transition from the ARPANET to the NSFNET Backbone. Originally, the FIXes were 10-Mbps Ethernets, but 100-Mbps FDDI was added later to increase bandwidth. The Commercial Internet Exchange (CIX, "kicks") on the West Coast came into existence because the people in charge of the FIXes were hesitant to connect commercial networks. CIX operated a CIX router and several FDDI rings for some time, but it abandoned those activities and turned into a trade association in the late 1990s. In 1992, Metropolitan Fiber Systems (MFS, now Worldcom) built a Metropolitan Area Ethernet (MAE) in the Washington, DC, area, which quickly became a place where many different (commercial) networks interconnected. Interconnecting at an Internet Exchange (IX) or MAE is attractive, because many networks connect to the IX or MAE infrastructure, so all that's needed is a single physical connection to interconnect with many other networks.
Before the early 1990s, the Internet was almost exclusively used as a research network. Some businesses were connected, but this was limited to their research divisions. All this changed when email became more pervasive outside the research community, and the World Wide Web made the network much more visible. More and more business and nonresearch organizations connected to the network, and the additional traffic became a burden for the NSFNET Backbone. Also, the NSFNET Backbone Acceptable Use Policy didn't allow "for-profit activities." In 1995, the NSFNET Backbone was decommissioned, giving room to large ISPs to compete with each other by operating their own backbone networks. To ensure connectivity between the different networks, four contracts for Network Access Points (NAPs) were awarded by the NSF, each run by a different telecommunication company:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
TCP/IP Design Philosophy
The fact that TCP/IP runs well over all kinds of underlying networks is no coincidence. Today, every imaginable kind of computer is connected to the Net, even though those connected over the fastest links, such as Gigabit Ethernet, can transfer more data in a second than the slowest, connected through wireless modems, can transfer in a day. This flexibility is the result of the philosophy that network failures shouldn't impede communication between two hosts and that no assumptions should be made about the underlying communications channels. Any kind of circuit that can carry packets from one place to another with some reasonable degree of reliability may be used.
This philosophy makes it necessary to move all the decision-making to the source and destination hosts: it would be very hard to survive the loss of a router somewhere along the way if this router holds important, necessary information about the connection. This way of doing things is very different from the way telephony and virtual circuit-oriented networks such as X.25 work: they go through a setup phase, in which a path is configured at central offices or telephone switches along the way before any communication takes place. The problem with this approach is that when a switch fails, all paths that use this switch fail, disrupting ongoing communication. In a network built on an unreliable datagram service, such as the Internet, packets can simply be diverted around the failure and still be delivered. The price to be paid for this flexibility is that end hosts have to do more work. Packets that were on their way over the broken circuit may be lost; some packets may be diverted in the wrong direction at first, so that they arrive after subsequent packets have already been received; or the new route may be of a different speed or capacity. The networking software in the end hosts must be able to handle any and all of these eventualities.
Because the TCP protocol takes care of the most complex tasks, IP processing along the way becomes extremely simple: basically, just take the destination address, look it up in the routing table to find the next-hop address and/or interface, and send the packet on its way to this next hop over the appropriate interface. This isn't immediately obvious by looking at the IP header (), because there are 12 fields in it, which seems like a lot at first glance. The function of each field, except perhaps the Type of Service and fragmentation-related fields, is simple enough, however.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Routing Protocols
This leaves just one problem unsolved: how do we maintain an up-to-date routing table? Simply entering the necessary information manually isn't good enough: the routing table has to reflect the actual way in which everything is connected at any given time, the network topology. This means using dynamic routing protocols so that topology changes, such as cable cuts and failed routers, are communicated promptly throughout the network.
A simple routing protocol is the Routing Information Protocol (RIP). RIP basically broadcasts the contents of the routing table periodically over every connection and listens for other routers to do the same. Routes received through RIP are added to the routing table and, from then on, are broadcast along with the rest of the routing table. Every route contains a "hop count" that indicates the distance to the destination network, so routers have a way to select the best path when they receive multiple routes to the same destination. RIP is considered a distance-vector routing protocol, because it only stores information about where to send packets for a certain destination and how many hops are necessary to get there. Open Shortest Path First (OSPF) is a much more advanced routing protocol, so much so that it was even questioned whether Dijkstra's Shortest Path First algorithm, on which the protocol is based, wouldn't be too complex for routers to run. This turned out not to be a problem as long as some restrictions are taken into account when designing OSPF networks. Instead of broadcasting all routes periodically, OSPF keeps a topology map of the network and sends updates to the other routers throughout the network only when something changes. Then all routers recompute the topology map using the SPF algorithm. This makes OSPF a link-state protocol. Rather than the number of hops, OSPF also takes into account the cost, which usually translates to the link bandwidth, of every link when computing the best path to a destination.
Obviously, periodically broadcasting all the routes or keeping topology information about every single connection isn't possible for the entire Internet. Thus, in addition to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Multihoming
Having connections to two or more ISPs and running BGP means cooperating in worldwide interdomain routing. This is the only way to make sure your IP address range is still reachable when your connection to an ISP fails or when the ISP itself fails. Compared to just connecting to a single ISP, multihoming is like driving your own car rather than taking the bus. In the bus, someone else does the driving, and you're just along for the ride. Under most circumstances, driving your own car isn't very difficult, and the extra speed and flexibility are well worth it. However, you need to stay informed about issues such as traffic congestion, and you need to maintain the car yourself.
There are some important disadvantages to using BGP. A pessimist might say that you gain a lot of complexity to lose a lot of stability. Implementing BGP shouldn't be taken lightly. Even if you do everything right, there will be times when you are unreachable because of BGP problems, when your network would have been reachable if you hadn't used BGP. There is a lot you can do to keep the number of these incidents and the time to repair to a minimum, however. On the other hand, if you don't run BGP, and your ISP has a problem in their network or the connection to them fails, there is usually very little you can do, and the downtime can be considerable. So in most cases, BGP will increase your uptime, but only if you carefully correct potential problems before they interfere with proper operation of the network.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: IP Addressing and the BGP Protocol
This chapter provides an overview of the IP address architecture and some interdomain routing history, followed by an explanation of the BGP protocol, information on how BGP relates to routing in general, and a discussion of Multiprotocol BGP.
IP addresses are made up of two parts: the network part and the host part. Because IP addresses are only 32 bits in length, it's not possible to have both a large host part (to accommodate networks with many hosts) and a large network part (to accommodate a large number of networks) at the same time. To get around this, there are three classes of IP addresses:
  • Class A addresses, with a 7-bit network part and a 24-bit host part, allow 128 networks with 16 million hosts each. The highest bit is always set to 0 in Class A address, so the first byte of Class A IP addresses ranges from 0 to 127.
  • Class B addresses, with a 14-bit network part and a 16-bit host part, allow 16384 networks with 65534 hosts each. The two highest bits are always set to 10 in Class B addresses, so the first byte of Class B IP addresses ranges from 128 to 191.
  • Class C addresses, with a 21-bit network part and an 8-bit host part, allow 2 million networks with 254 hosts each. The three highest bits are always set to 110 in Class C addresses so, the first byte of Class C IP addresses ranges from 192 to 223.
Note that the first address in a network (the all-zeros address) is the network address, and can't be used. The last address (with all the bits in the host part set to one) is the network broadcast address and can't be used either. Addresses with a first byte in the 224–239 range are multicast (Class D) addresses, and those in the range 240–255 are reservedfor future use. See for more information on the IPv4 address space.
The network/host structure in IP assumes each network has only a single lower-layer network, such as an Ethernet. Using switches, it's of course possible to build an organization-wide Ethernet, but in practice, most networks consist of several subnetworks. To deal with this, IP has the notion of a subnet mask. The subnet mask determines how many bits in the address are really used to number hosts and how many are used to number the different subnets within the network. For instance, an organization with a Class B network may use a subnet mask of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
IP Addresses
IP addresses are made up of two parts: the network part and the host part. Because IP addresses are only 32 bits in length, it's not possible to have both a large host part (to accommodate networks with many hosts) and a large network part (to accommodate a large number of networks) at the same time. To get around this, there are three classes of IP addresses:
  • Class A addresses, with a 7-bit network part and a 24-bit host part, allow 128 networks with 16 million hosts each. The highest bit is always set to 0 in Class A address, so the first byte of Class A IP addresses ranges from 0 to 127.
  • Class B addresses, with a 14-bit network part and a 16-bit host part, allow 16384 networks with 65534 hosts each. The two highest bits are always set to 10 in Class B addresses, so the first byte of Class B IP addresses ranges from 128 to 191.
  • Class C addresses, with a 21-bit network part and an 8-bit host part, allow 2 million networks with 254 hosts each. The three highest bits are always set to 110 in Class C addresses so, the first byte of Class C IP addresses ranges from 192 to 223.
Note that the first address in a network (the all-zeros address) is the network address, and can't be used. The last address (with all the bits in the host part set to one) is the network broadcast address and can't be used either. Addresses with a first byte in the 224–239 range are multicast (Class D) addresses, and those in the range 240–255 are reservedfor future use. See for more information on the IPv4 address space.
The network/host structure in IP assumes each network has only a single lower-layer network, such as an Ethernet. Using switches, it's of course possible to build an organization-wide Ethernet, but in practice, most networks consist of several subnetworks. To deal with this, IP has the notion of a subnet mask. The subnet mask determines how many bits in the address are really used to number hosts and how many are used to number the different subnets within the network. For instance, an organization with a Class B network may use a subnet mask of 255.255.255.0, so that there are eight bits available to number hosts (for a maximum of 254 hosts per subnet) and eight bits to number the subnets.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interdomain Routing History
During the rule of the ARPANET, the original routing protocol between the Interface Message Protocols evolved into the Gateway-to-Gateway Protocol (GGP, RFC 823). This is a distance-vector protocol like RIP, but unlike RIP, it uses a reliable transport mechanism, and routing updates are sent only when there is a change in reachability status for some part of the network.
In 1984, the Exterior Gateway Protocol became formalized in RFC 904. As a routing protocol, EGP isn't very advanced: it doesn't support topologies with loops in them, for instance. The main intended purpose for the protocol was to connect "stub gateways" (routers connecting to a nontransit network) to the rest of the Net and have those stub gateways announce reachability information for their AS. EGP needs the network to have a tree structure, in which information flows either up, in the direction of the core or backbone, or down, in the direction of stub networks. New in EGP was the notion of different routing domains: interior within an autonomous system and exterior between ASes. Within the ARPANET, GGP remained in use as the interior protocol.
In 1989, the new Border Gateway Protocol no longer let routers find neighbors on their own; it required them to be configured manually and ran over TCP. BGP Version 1 (RFC 1105) still had the notion of up, down, or horizontal relationships, as in EGP. This limitation was abandoned in BGP-2 (1163), along with major changes to the message formats. BGP-3 (RFC 1267) introduced, among other things, the BGP identifier field in the open message and defined how to use this field to decide which connection is terminated when two BGP neighbors each initiate a TCP session at the same time (a connection collision). In 1994, BGP-4 (RFC 1654, later RFC 1771) added CIDR, aggregation support, the Local Preference attribute, and a per-connection hold time.
While BGP was still in its infancy, work was being done on an even more groundbreaking approach to interdomain routing: the Inter-Domain Policy Routing (IDPR) protocol (RFC 1479). IDPR tries to look at the policies of a source and destination network and the networks in between and attempts to accommodate user requests for certain services and QoS guarantees. Unlike BGP, IDPR uses a link-state mechanism for distributing routing information. This makes it possible for the source to apply its policies more accurately. But it doesn't stop there: the protocol breaks the fundamental hop-by-hop forwarding paradigm of IP. To do this, all traffic is tunneled. Tunneling hides the network layer: in essence, the source gets to decide how routers further upstream have to route the packet. With IDRP, it's no problem for an ISP to send traffic from one customer over one transit connection and traffic from another customer over another transit connection even if the destination is the same in both cases. An ISP may want to do this if one transit ISP offers a much better service but is also more expensive. One customer may need the better service level, while the other doesn't want to pay too much. With BGP this isn't possible, because hop-by-hop forwarding takes only the destination address into account, and traffic flows that have come together at some point can't be separated later. (At least, they can't be separated without employing special techniques such as policy routing.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The BGP Protocol
BGP uses TCP on port 179 for communication between neighbors. This is unusual: all other routing protocols either run directly on top of IP or use UDP. This makes it possible to send broadcasts or multicasts to discover neighboring routers. This neighbor-discovery functionality isn't required for BGP, however, so running over TCP avoids having to incorporate a significant amount of transport protocol functionality, such as fragmentation, sequencing, and retransmission of data.
BGP Versions 1, 2, and 3 should be considered completely obsolete. Whenever "BGP" is used, it means BGP-4.
When BGP neighbors establish a TCP session, they start exchanging BGP information in the form of "messages." Each message starts with a header, followed by the contents of the message, as shown in .
Table : BGP message header format
Marker
Length
Type
Message contents
16 bytes
2 bytes
1 byte
0 −4077 bytes
The marker usually contains all 1s and is used to check whether the sender and receiver are still synchronized. If the receiver finds an unexpected value in the marker field, something must have gone wrong, so the receiver sends back an error indication and closes the connection. The length field holds the length of the BGP message, which has a minimum length of 19 bytes (just a header with no message) and a maximum of 4,096 bytes. The type indicates the message's purpose: open (1), update (2), notification (3), or keepalive (4) (as defined in RFC 1771, with more message types defined in later RFCs).
Both sides send an open message immediately after the TCP session has been established. The open message conveys important information about the BGP speaker's configuration and abilities. The format of the open message is shown in .
Table : BGP open message format
Version
My AS
Hold time
Identifier
Par len
Optional parameters
1 byte
2 bytes
2 bytes
4 bytes
1 byte
0 − 255 bytes
The first field indicates the BGP version, which would normally be 4. The next field is the sender's AS number. The hold time is the maximum number of seconds the session may remain idle before it's torn down because of a timeout. The lower of the hold times in both open messages is used. The minimum hold time is three seconds; the value zero means the session will never time out. The identifier field contains one of the BGP speaker's IP addresses. A router must use the same identifier for all BGP sessions. The optional parameter length field ("par len") indicates the absence (with a zero value) or length of an optional parameters field. If there are any optional parameters, they are all preceded by a one-byte parameter type and a one-byte parameter length. The optional parameters field negotiates the use of authentication and extended capabilities, such as multiprotocol extensions and route refresh.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Multiprotocol BGP
To enable use of BGP with protocols other than just IPv4, RFC 2858 describes a way to encode reachability information for different address families in two new optional nontransitive path attributes. This makes it possible to encode routing information for different address families in BGP-4 without breaking compatibility with older implementations. (However, I can't think of any reason to send a router reachability information for a protocol it doesn't support in the first place.) shows the first new attribute, MP_REACH_NLRI.
Table : The MP_REACH_NLRI attribute
AFI
SAFI
NH len
Next hop
SNPAs
SNPA
NLRI
2 bytes
1 byte
1 byte
variable
1 byte
variable
variable
The Address Family Identifier (AFI) does just what it's name suggests. Note the subtle distinction between addresses and protocols, which isn't important for IP and similar protocols, but the multiprotocol extensions can also be used to carry different address families, such as E.164 (phone numbers), which aren't tied to a specific protocol. The Subsequent Address Family Identifier (SAFI) indicates the kind of addresses relative to the AFI. In IPv4 and IPv6, this is used to distinguish among routing information for unicast, multicast, or both.
The next hop field holds the network layer address for the next hop and is preceded by a length field ("NH len"). The next hop address in the format of the specified address family may not be enough information to forward packets or set up connections successfully, so there is also room for Sub-Network Point of Attachment (SNPA) information: in non-OSI terms, one or more MAC addresses for the next hop. The SNPAs field holds the number of SNPAs, and each SNPA is encoded in a somewhat perverted prefix format: rather than counting bits or bytes, the length field with the SNPA is in "semi-octet" (4-bit) units.
The NLRI is in regular prefix format for the IPv4 and IPv6 unicast, multicast, and unicast plus multicast SAFIs. shows the second new attribute, MP_UNREACH_NLRI.
Table : The MP_UNREACH_NLRI attribute
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interior Routing Protocols
There is no requirement for networks running BGP to have an IGP as well. Simple multihomed networks with two routers run just fine without an IGP: a few static routes are all that's needed, because all traffic goes to a directly connected network, to the rest of the world, or to the other router. For larger networks, IGPs are a fact of life. Your only choice for an EGP is BGP-4, but IGPs let you use any interior routing protocol you desire. On a Cisco router, you have the following choices: RIP, IGRP, EIGRP, OSPF, and IS-IS.
The Routing Information Protocol (RIP) is a simple distance-vector routing protocol. It listens for routing updates from other routers and installs new routes into the local routing table, and it transmits the contents of the routing table every 30 seconds for the benefit of other routers. The metric is a simple hop count with a maximum of 15 hops. RIP is very old, but because it's so simple, it's still the most widely implemented routing protocol. All regular routers support it, but equipment such as terminal servers and many hosts can also do RIP.
RIP is completely unsuitable for applications in which routers must quickly adapt to changes in link status, because it can take several minutes for reachability and (especially) unreachability information to converge throughout the network. RIP can still be useful, however, to convey some limited routing information (such as a default route) to RIP-aware systems that would otherwise need to depend on a static default route.
The original RIP doesn't support VLSM, but RIPv2 can handle different subnet masks within a single classful network. RIPng ("next generation") is available for use with IPv6.
The Interior Gateway Routing Protocol (IGRP) is a Cisco-proprietary distance-vector protocol. It doesn't support VLSM. Enhanced IGRP (EIGRP) is just what the name says: an enhanced version of IGRP, which does support VLSM. Cisco's goal was to create a routing protocol without the traditional distance-vector limitations, such as slow convergence, but also without the complexity of link-state protocols. They did a reasonably good job, but because EIGRP is available only on Cisco routers, it isn't widely adopted within the IP world, where open standards are appreciated.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Physical Design Considerations
"The OC-3 circuit is online again. The telco reports nettles had grown into the A/C exhaust."
This chapter deals with the costly parts of the network: the physical properties, such as hardware, locations, and topology, and with ISPs and bandwidth.
There is a lot of theory about network reliability. If you are building a network that has to conform to a specific uptime figure, it's a good idea to spend some time reading up on this. You'll learn how to calculate availability figures for your network using known Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) values for your equipment. For a single component, the availability is just the MTTR as a percentage of the MTTR plus the MTBF:
A component with a 100,000-hour MTBF (more than 11 years between failures) and an eight-hour MTTR has an availability of 99.992%, statistically. The easiest way to increase this figure is by buying equipment that runs longer before it fails (higher MTBF value) and takes less time to repair or be replaced by a spare (lower MTTR value).
Calculating availability becomes more complex as the number of components increases, but it basically boils down to this: as you add more components that must all be "up" for your network to function, your availability figure drops accordingly. To get the availability for the system as a whole, multiply the availability figures for all components. You can get around the decreasing availability as you add components by having two or more interchangeable components to perform a certain function, so that one of them can fail without impact. This is shown in . On the left, the figure shows three components, with the middle one decreasing the total availability significantly. On the right, the function of the middle component is performed by two components to get better availability.
Figure : Availability figures for different designs
In the design on the right in , either the left or right component or both middle components have to fail before there is an outage. The two components with 90% availability together function as a single component with 99% availability.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Availability
There is a lot of theory about network reliability. If you are building a network that has to conform to a specific uptime figure, it's a good idea to spend some time reading up on this. You'll learn how to calculate availability figures for your network using known Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) values for your equipment. For a single component, the availability is just the MTTR as a percentage of the MTTR plus the MTBF:
A component with a 100,000-hour MTBF (more than 11 years between failures) and an eight-hour MTTR has an availability of 99.992%, statistically. The easiest way to increase this figure is by buying equipment that runs longer before it fails (higher MTBF value) and takes less time to repair or be replaced by a spare (lower MTTR value).
Calculating availability becomes more complex as the number of components increases, but it basically boils down to this: as you add more components that must all be "up" for your network to function, your availability figure drops accordingly. To get the availability for the system as a whole, multiply the availability figures for all components. You can get around the decreasing availability as you add components by having two or more interchangeable components to perform a certain function, so that one of them can fail without impact. This is shown in . On the left, the figure shows three components, with the middle one decreasing the total availability significantly. On the right, the function of the middle component is performed by two components to get better availability.
Figure : Availability figures for different designs
In the design on the right in , either the left or right component or both middle components have to fail before there is an outage. The two components with 90% availability together function as a single component with 99% availability.
For connectivity to the Internet, availability calculations don't mean much, because you have to rely on suppliers (the power company, telcos, and ISPs) with largely unknown availability figures. And there are a lot of backhoes out there.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Selecting ISPs
Good ISPs are essential for a reliable connection to the Internet. Not so much because an ISP network should never fail: it shouldn't, but multihoming protects you against problems when this happens. What really matters is whether an ISP is reachable and willing to work with you when there is a problem. The traits that separate the good ISPs from the mediocre and bad are:
  • Knowledgeable staff
  • Willingness to accept unusual but reasonable BGP address announcements
  • Good reachability by email and phone. They should work with you when there is a (distributed) denial-of-service attack.
  • Good "Internet citizenship." A good ISP doesn't pollute the Net and discourages others from doing so. Specifically, it filters out packets with spoofed source addresses wherever possible.
When selecting a second ISP, make sure they don't depend on the same single upstream ISP that your first ISP does.
There are, of course, other minor details, such as an extensive network, good peering, sufficient bandwidth, and price. You may also want to consider the stability of the companies you do business with. Later in the book, there are more in-depth discussions on the cooperation you are likely to need from your ISPs.
Some ISPs offer Service Level Agreement (SLA) guarantees. This usually means you get some money back when the network performance or uptime isn't what it should be. But don't be too impressed with this: just because a network offers a guarantee, it doesn't mean it can actually deliver the guaranteed service level.
The best way to evaluate an ISP is by talking to their customers, especially a customer with needs similar to your own.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Bandwidth
Bandwidth is the amount of data that can traverse a communication channel per unit of time. On digital circuits, we measure bandwidth in bits per second (bps). When talking about computer memory, a kilobit is 1024 bits, but a "kilobit per second" (Kbps) is 1000 bits per second, or 125 bytes per second. Mbps is 1 million bits (122 KB) per second, and Gbps is 1 billion bits (119 MB) per second. Bandwidth doesn't say much about the actual speed of a connection: a 45-Mbps satellite channel has exactly the same bandwidth as an earthbound T3 circuit, but it takes the bits about 120 milliseconds to travel the 45,000 miles to the satellite and back to Earth. This could take as little as a few hundred microseconds for a short terrestrial T3 line. So, when you read "speed" you should think "bandwidth."
Figuring out how much bandwidth to buy isn't usually too much of a problem when you're single-homed. If you aren't yet connected to the Net, anything will be a vast improvement, even if the connection isn't as fast as you'd like. When you are already connected, the decision to upgrade (or downgrade) your bandwidth or to keep things as they are is mostly a matter of weighing the number of complaints about speed against the costs. Determining the necessary bandwidth this way works well in many cases, but it's hardly scientific. More importantly, it won't work for a second line unless you are prepared to bring down the first line to evaluate the speed of the second.
To determine required bandwidth, it's necessary to translate a user experience into numbers. A user doesn't care how many milliseconds it takes a packet to travel from one end of the continent to another; she just wants web pages, email, and files to load quickly. Every one of these applications uses transactions: the user starts something, the remote server responds, and data is transferred. From a network-centric viewpoint, only the last stage of the transaction, the transfer of data, is of any interest, but a responsive DNS that doesn't delay the initial connection and a server that can handle user requests fast enough are also essential. It's important to get good values for both the number of transactions and their size. For web servers (and many other types of servers), this information is available in the log file. You can analyze the log file yourself, but a lot of software is also available to do this.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Router Hardware
When you have decided on the amount of bandwidth you need and the kind of connections to deliver this bandwidth, it's time to go router shopping. Routers share a lot of technology but little economy with regular computers. There are still routers sold that are based on early 1990s technology: it's expensive to design new models, so it's more cost-effective to keep building older models and sell them for a lower price. The most important differences between routers are:
Architecture
Smaller routers use a single CPU for both packet forwarding and route processing. Some routers use Symmetric Multiprocessing (SMP), in which several CPUs share the route-processing and forwarding functions. Another approach is to distribute these functions over a dedicated CPU for route processing and other CPUs or ASICs (special-purpose chips) that handle forwarding traffic from one or more interfaces each.
Redundancy features
Many routers have room for redundant power supplies. Some even have room for a second CPU board that kicks in when the first CPU fails due to hardware or software problems. Another important feature is hot swapping. This means that interface cards can be inserted and removed from the chassis without powering down or rebooting the router.
Interfaces
Some smaller routers have a fixed configuration. Most routers can handle a variety of interface cards, but it takes at least a mid-grade model to connect to higher bandwidth connections such as T3 or E3 and OC-3. Only high-end routers can handle Gigabit Ethernet or OC-12 and faster interfaces.
Forwarding speed
Even if high-speed interfaces are available for a router, it doesn't mean the router can fill up all the available bandwidth. This is especially true for LAN interfaces: Fast Ethernet can handle more than 100,000 packets a second in each direction, but the same isn't true for every router that has a Fast Ethernet interface.
Route processing and memory limitations
CPU and memory limitations are important considerations when running BGP with full routing, especially with full routing from multiple BGP peers. A distributed design with separate processors for forwarding and route processing is at an advantage when forwarding packets and processing route information at the same time, but single-processor routers are generally faster when the BGP table is first initialized and no packets are forwarded yet.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Failure Risks
After buying some nice routing and switching hardware, you'll need a good place to put it. The best place would be a data-processing facility specially suited for housing computer and network hardware, with extensive protection against all kinds of hazards, such as fire, water, lightning, riots and civil unrest, tornadoes, earthquakes, and long-term power failure. This isn't always possible, however. If you have to house your equipment under less than ideal circumstances, at least take the time to think about water and power problems, and as always, use common sense.
Water is everywhere. Your equipment can be exposed to it when there is a leaking roof, a broken window, a clogged drain, or broken plumbing. And that's just the small stuff. Rivers and canals sometimes overflow, and if there is a small fire, there is usually more water damage than actual fire or smoke damage. I have personally dealt with water problems twice. Water damage to your equipment is a serious risk.
Water can quickly ruin the building infrastructure or the equipment itself by shorting electrical circuits or, more slowly, by inducing corrosion. Cables dropping down from overhead cable guides may lead the water directly into the equipment. Outlets close to the (raised) floor will short-circuit when there are only a few inches of water present. If possible, cables should be coming in from below a raised floor, and the outlets and connectors should be at least a few inches above floor level. Cabinets should be closed on top, but make sure the temperature inside doesn't get too high. Equipment should never be on the lowest floor of a building, especially not below ground level, because water from floods, leaks, and sprinkler systems will collect there.
In many parts of the world, the power grid is so reliable that installing a cheap uninterruptible power supply (UPS) may actually increase the likelihood of power failures to your equipment. The main reason to have UPS, even if it supplies backup power only for a few minutes, is to allow PCs and servers to shut down properly so that open files and the filesystem aren't damaged. This isn't usually an issue for network equipment. After power is reapplied, the routers and switches boot and read configuration data from flash or nonvolatile memory, and they are up and running in half a minute or so. If all the network users will be down when there is a power outage, there is little advantage to having the network equipment connected to an UPS.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Building a Wide Area Network
Good redundancy and robustness start at home. Having several connections to the Internet won't do you much good if hosts in your network are unable to use them because of a problem in the internal network. I'm not going to bore you with a detailed description of how to run an inhouse network. Sure, it can be hard at times, but at least the problems are limited in space, if not in time and complexity, and you're there to solve them. Don't underestimate the difference this makes. Keeping a geographically dispersed network running involves additional challenges.
When you start building a network spanning more than one location, there are many complications, but they all boil down to two things: you have to rely on other people, and everything takes much more time. An example: you make a mess of an access list and can no longer access a router. If this router is in another room in the same building, you can just walk there, connect a terminal to the console and fix the problem. Or you can use the oldest network administrator's trick in the book: reboot the router, so it returns to the configuration you saved before you started making configuration changes. Now imagine this router being 3,000 miles away. You don't want to fly out just because you made a mistake configuring a router. So you need someone in the remote facility to walk up to the router and reboot it. If the other location is a branch of your own company, this shouldn't be too much of a problem, but things are different if this is a colocation facility. Unless you're particularly fond of airline food, you need remote hands at the colocation facility. But you only want them to reboot your router on your request, and not when anyone else asks them to do this, so this introduces authentication and authorization issues.
Since you are now relying on the services of housing facilitators to colocate your equipment and on telco circuits to connect it all together, you are much more vulnerable to problems in other organizations. Telcos, large ISPs, and colocation facilitators are all capital-intensive businesses, and they depend on highly trained staff. This means that when the economy is booming, they all have a hard time expanding their services, because their suppliers can't deliver fast enough, and it gets hard to hire the necessary new staff. A dwindling economy paradoxically leads to a similar situation: cost cuts make it impossible to expand services as needed by customer demand (less money doesn't mean less network traffic), and there aren't enough knowledgeable employees because of layoffs.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Network Topology Design
There are many books on network design, and they often go into great detail about the different routing protocols, showing nice diagrams of routers and switches and how to connect them. They often fail to address with enough detail the most challenging part, however: coming up with a good topology. Maybe this is because even really bad topologies can work fairly well, if you spend enough time and energy on them. But choosing a better way to connect various parts of the network together will at least save you a lot of time and possibly money.
The first decision in building a network is usually not made on a conscious level: the decision involving which design model, philosophy, or methodology to use. Cisco literature heavily gravitates towards the Hierarchical Design Model illustrated in .
Figure : Hierarchical Design Model overview
In this design model, the network is divided into three parts: core/backbone, distribution, and access. The core network connects the different parts together with minimal overhead. The access parts of the network connect to the users, and the distribution layer sits between core and access to provide for access and quality of service policies. I'm not entirely comfortable with this way of classifying network components. It seems to me that the Hierarchical Design Model focuses primarily on the internals of large corporate or campus networks, without fitting smaller or ISP networks very well. In my view, each router interface belongs to a certain part of the network: external, core, or access. The backplane of the router is considered a part of the core. Switches shouldn't be considered here, because that would violate the separation between the datalink and network layers. Each type of interface has specific needs:
External interfaces
The filters on external interfaces exist to stop traffic that should never enter or leave the network, such as malformed or incorrectly addressed packets.
Access interfaces
The filters on access interfaces range from a simple filter to block packets with falsified source addresses, to the most stringent filters imaginable, depending on the needs of the systems this specific access interface connects to.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: IP Address Space and AS Numbers
All IP addresses aren't created equal. Which kind of IP addresses to announce over BGP is an important decision: it has a large impact on the reachability of your network, and it also has important financial consequences. The main decision is whether to keep using ISP-based addresses or to apply for an independent address range of your own. This poses an important question: where do IP addresses come from, if not from your ISP?
The Internet Assigned Numbers Authority (IANA) is responsible for assigning the protocol numbers used on the Internet. This includes IP addresses and AS numbers, but the IANA has delegated these activities to three Regional Internet Registries (RIRs):
APNIC: http://www.apnic.net
The Asia-Pacific Network Information Centre in Brisbane, Australia, serves most of Asia, Australia, and the Pacific.
ARIN: http://www.arin.net
The American Registry for Internet Numbers in Chantilly, Virginia, United States, serves North America.
RIPE NCC: http://www.ripe.net
The Réseaux IP Européens Network Coordination Centre in Amsterdam, The Netherlands, serves Europe, the Middle East, and the former Soviet Union countries.
Work is currently underway to establish two additional RIRs: one serving the region of Latin America, and the other serving continental Africa. For the time being, ARIN serves Latin America, the Caribbean, and Africa south of the Sahara, and the RIPE NCC serves Africa north of the Sahara.
In turn, the RIRs delegate responsibility for assigning IP address space to Local Internet Registries (LIRs) or directly to ISPs (which are also considered LIRs by RIPE).
When an ISP requests address space for the first time, the RIR allocates a relatively large block of address space, usually a /20. Then the RIR assigns a smaller range of addresses from this allocation to the ISP. The ISP now gets to announce the full allocation over BGP, but actual use is limited to the addresses that are actually assigned. If the ISP requests more address space for their own use or for customers, further assignments are made from the initial allocation. New blocks of address space are allocated to ISPs from which to draw further assignments if necessary. These allocations are called Provider Aggregatable (PA) address blocks, because an ISP can aggregate several address ranges assigned to customers into a larger range and so announce a relatively small number of routes (one per PA block) over BGP.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Different Types of Address Space
The IP address assignment policies the RIRs use themselves and impose on LIRs and ISPs are based on RFC 2050, "Internet Registry IP Allocation Guidelines." These IP address assignment guidelines were developed in coordination with user communities, the Internet Engineering Steering Group (IESG), and the Internet Engineering Task Force (IETF). They have three main goals:
  • Conservation, to let the remaining IPv4 address space last as long as possible
  • Routability, to make sure that assigned addresses are reachable throughout the Internet
  • Registration, so every assignment is unique and to aid troubleshooting
Unfortunately, the conservation and routability objectives are at odds with each other. Conservation calls for assigning the smallest possible number of addresses, while routability is best served by keeping the total number of assigned address ranges to a minimum by assigning large blocks to avoid fragmentation of the IP address space. For regular single-homed organizations, there isn't much of a problem; ISPs can assign small address ranges from a PA block to such organizations, keeping the number of routes in the global routing table relatively small—one for each PA block. For this reason, it's hard to get IP addresses from a RIR directly for organizations that connect only to a single ISP: this unnecessarily breaks aggregation. Multihomed networks, on the other hand, can't get around this and have three options:
Request "Provider Independent" (PI) address space
PI space can't be aggregated by an ISP, so it must always be announced over BGP "as is." This was often done in the past, and it's still possible (but not easy) to get even small blocks of PI space at the time of this writing. It seems this is discouraged, and/or the policies on this may change, but don't take "no" for an answer too easily when requesting PI space.
Act as an ISP and request a PA block of their own
This is hard to do and expensive, but if you have enough ISP-like traits, it can be worth it, because this is the "highest quality" address space.
Request address space from an ISP's PA block but announce it as PI space
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Requesting Address Space
If you have decided you want to use addresses from the PA space of one of your ISPs, you should discuss this with both ISPs to make sure they're on board with what you want to do. It's a good idea to get this in writing. Then you can proceed to request the addresses from your ISP.
PI addresses can be requested directly from a RIR, but in most cases, it's better to request them through one of your ISPs or at least consult with the ISP first. They'll have to forward your request to the RIR anyway, but involving an ISP will most likely save you time. Your ISP can make sure the request is in order before forwarding it, and the RIRs have more trust in an ISP they've worked with before than in someone they don't know. It may also save you some fees. Make sure your ISP understands you're talking about provider-independent address space, since this isn't all that common; few organizations qualify for a large enough block of PI space to avoid filtering.
If you want a PA block of your own, consult your RIR's web site.
As an end-user organization, you will be asked to provide a full list of subnets with the projected immediate and future use of each subnet when requesting IP address space. shows how this would appear in the ARIN request form.
Example . A list of subnets as required for ARIN address requests
--------------------------------------------------------
Subnet#  Subnet Mask      Max  Now   1yr   Description
--------------------------------------------------------
1.0      255.255.255.192   64   36    49   Wired PCs
1.1      255.255.255.224   32   15    30   Wireless PCs
1.2      255.255.255.240   16    7    10   Web servers, DNS
1.3      255.255.255.248    8    8     8   Dial-up modems
1.4      255.255.255.248    8    2     2   Firewall DMZ
--------------------------------------------------------
Totals                    128   68    99
--------------------------------------------------------
The first number ("Subnet#") doesn't mean anything: it's just to keep the subnets apart in later discussions. Note that the "Max" number is the total number of addresses in the subnet, including the normally unusable first (network) and last (broadcast) address, but the "Now" and "1yr" numbers include only the number of addresses actually used for hosts and other systems that require an IP address, such as routers. When compiling the list, start with the largest subnet, so that all subnets automatically start on the proper bit boundaries. See for more information on (sub)netmask calculations. The use of Variable Length Subnet Masking (VLSM) and subnet zero are mandatory, but this shouldn't be a problem for today's routers. There are also policies about giving each virtual web server its own IP address and giving dial-up, ADSL, and cable users fixed IP addresses. These policies boil down to something like, "Please use dynamic addresses, but if you insist on using static addresses, we'll assign them, for now."
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Renumbering IP Addresses
Obviously, you need to renumber IP addresses when switching to a newly requested PI or PA block of your own. But renumbering is also advisable when using multiple address ranges from an ISP: it's better to announce a single, large block than to announce several small ones. Announcing as few routes as possible keeps the size of the global routing table down, which is good for everyone. Also, the larger your announcement, the less likely your route will be filtered, which is good for you. This means asking your ISP to exchange several small address ranges for a single larger one.
Renumbering is a lot of work, but there are some ways to make it as painless as possible to reconfigure the network equipment and deal with users and other departments who also have to make changes:
Make a plan
That way you don't forget anything. And if you update the plan when you find something unexpected, you'll have a much better plan if you ever have to renumber again. (Knock on wood.)
Use the new and old addresses side by side for a while
It's next to impossible to change everything at once, so you need some overlap between the moment the new addresses become available and the moment the old addresses are decommissioned.
Register the new DNS IP addresses as soon as possible
You need to register the IP addresses for your name servers with the RIR and/or your ISPs for the reverse mappings, and with InterNIC/NSI and all other registries where you have registered domain names, including the registries for any country TLD domains you have. This may take some time, so start this process as soon as your name servers are able to respond to queries sent to the new addresses. Also, all name servers that run as secondaries for your domains, or that are primaries for domains you are secondary for, need to be configured with the new addresses.
Make the translation from the old to the new numbers as simple as possible
It's much easier to remember and explain that 123.56.7.x becomes 213.75.6.x than arbitrary individual mappings from old addresses to new ones. You will probably make some subnets larger and other smaller, so there will be exceptions to the basic rule, but having a rule with a number of exceptions is still better than having no rule at all. For instance, if you have two Class Cs with about 100 addresses in use, only a few of which with a host address over 128, you might want to merge them into a single new
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The AS Number
The next step towards running BGP is requesting an AS number. The IANA has reserved the AS numbers from 64512 to 65535 for private use, similar to the 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16 IP address ranges for private networks. Note that, unlike networks that use RFC 1918 address space, a network using a private AS number can still enjoy full connectivity to the entire Internet. The use of a private AS number isn't limited just to private networks but is also useful in cases where a network is fully connected to the Net, but the actual way in which this is accomplished doesn't have to be communicated throughout the world. For example, a company can have two connections to the same ISP and use BGP to route traffic over those connections in a fault-tolerant way. An AS number is needed to run BGP in this setup, but it can be a private one: the ISP can leave out the specific route to this customer, because this information is covered by an aggregate. Another example would be two companies that independ