No doubt you have experienced some sort of routing meltdown in your networking career. Unfortunately, it is a fact that failures are inevitable, whether software or hardware or a combination of the two. The challenge is to make them as painless as possible.
Let’s take a simple routing process failure as an example, in which a routing daemon restarts on one of the core routers. The restart causes a networkwide disruption to many of the network processes and brings down all protocol adjacencies. While the router is recovering, all its neighbors shift their traffic in different directions, using redundant links or paths. It is possible that the shifted traffic, now on oversubscribed links, causes congestion and potential traffic drops. When the failed router recovers, it establishes new adjacencies and advertises new routing information, which in turn causes another traffic shift back to the original paths. This is actually how basic routing handles failures; according to the protocol definitions, each step in this scenario happened as it was designed to. But how does this situation affect the users?
The traffic churn has substantial impact on the user experience. This is especially true with the content being delivered using modern next-generation networks, which consist primarily of video and voice communication, in which any delay and jitter can cause havoc. If you have seen a distorted or frozen picture on your cable TV, it was likely caused by jitter. Moreover, if the next-generation telephone network is riding on top of an IP network, a 911 emergency call (or any other call) could be distorted—just enough for the operator to misunderstand a street location, house number, or other important detail.
Because most modern routers are capable of forwarding, even when the control plane is incapacitated, the traffic churn resulting from a routing process hiccup is absolutely unnecessary. In certain scenarios, it might be better if the neighbors refuse to establish a new adjacency that shifts traffic to a new path after a neighboring router fails.
Graceful Restart (GR) allows you to address this scenario. In a nutshell, GR allows the failure of a neighboring router to go undetected for a period of time so that traffic continues to be forwarded along the already established paths, and no adjacencies are broken. This small detail helps in both software and hardware failures. The period of nondetection was originally designed to support software failures. However, with advances in hardware design and in the separation of the control and forwarding planes, this concept has found its real purpose as a technology complementing nonstop forwarding models. If all the router’s neighbors somehow ignore the loss of communication and continue forwarding toward the troubled router, which is still capable of forwarding, all problems induced by routing churn—including convergence time issues, delays, and jitter—simply go away. Figure 4-4 illustrates the basic concepts of GR.
While different protocols implement GR slightly differently, from a high availability point of view the concept is the same for all protocols. The rest of this chapter examines how the various protocols implement GR. Before we begin, however, it is important to point out two crucial conditions that are the prerequisites for successfully deploying GR:
The router must support Non-Stop Forwarding (NSF). During a failure, this router must be able to forward the traffic based on old forwarding entries found in its forwarding table.
No topology changes must occur during the failure event.
Topology changes can occur with certain network designs without any impact on traffic and quality of information. We cover the actual configuration steps for such scenarios later in this chapter.
To understand how GR is implemented in Open Shortest Path First (OSPF), let’s first analyze the basics of OSPF protocol communications. Neighbor adjacency between two OSPF speaking routers is formed by exchanging OSPF Hello messages. After the initial Hello messages, OSPF passes through several states and then establishes full neighbor adjacencies. OSPF advertises routing information using link-state update messages called link-state advertisements (LSAs). For normal routing, OSPF uses standard LSAs. With the integration of MPLS traffic engineering and GR into OSPF, opaque LSAs were created to carry information for these protocol extensions. Depending on the scope of advertisement, these LSA updates can be link-local, area-wide, or across the entire OSPF domain, which is LSA type 9, 10, or 11, respectively. Because GR involves communication between a router and its direct neighbors, it is implemented using link-local scope messages.
Grace (type 9) LSAs negotiate and exchange restart information between OSPF neighbors. The information relevant to the restarting event is carried in the body of the message using the type, length, value (TLV) system:
Type (two octets):
1 (grace period)
2 (restart reason)
3 (interface IP address)
Length (two octets):
Grace period (four octets)
Restart reason (one octet)
Interface IP address (four octets)
Here are more details about the information carried in the TLV message:
This value signals how long a helper should help the router in question. When this period expires, the helper brings down the adjacency with the restarting router, flushes its LSAs from the database, and floods new LSAs to the rest of the network informing the other routers that it has lost its adjacency to the neighbor.
This is a required field. Four different values are defined:
1 (software restart)
2 (software upgrade)
3 (control plane switchover)
This is the IP address of the interface sending an update.
By default, every router running JUNOS is in this mode, without any configuration. If no network activity exists, all routers stay in this mode forever.
Upon receiving a Grace-LSA from the neighboring router, a possible helper is promoted to the helper role. It maintains the adjacency, marks all neighbors’ routes as stale, and continues to forward traffic toward the restarting router for a limited period of time. Once the neighbor finishes its restart event, or after the restart timer expires, the helper router flushes the old routes from its routing table. If the neighbor does not recover before the restart timer expires, the adjacencies are also brought down.
The router that has undergone a restart event, or is about to undergo one, marks all its routes as stale, in a form of kernel route, and sends a Grace-LSA to its neighbors requesting the help until its control plane is fully recovered.
Let’s analyze two different types of failures and the steps OSPF takes to signal the failures. In the case of software restart, OSPF sends a Grace-LSA prior to the restart event. This message ensures that all neighbors keep the adjacency in the “full” state until the router recovers its control plane. In the case of a hardware failure or a GRES event, OSPF sends a Grace-LSA after the control plane has recovered, even before it generated the first Hello message. This order of messages is critical. If the Hello message is received before the Grace-LSA, the original adjacency is destroyed and OSPF attempts to establish a new one.
In either of these two failure cases, after an OSPF adjacency has been established, the restarting router requests help from its neighbors to rebuild its link-state database. All neighbors flood back the original LSAs advertised by the restarting router. After the restarting router receives all LSAs and has updated its LSA database, the GR event is complete and all kernel routes are removed from the routing and forwarding entries.
Because JUNOS supports GR for all major routing protocols, there is a single configuration statement that enables it for all of the protocols at once. However, in some network topologies, because of the different GR capabilities of particular routers, you might want to enable GR for some protocols and disable it for others. The following set of commands gives us the ability to do just that.
set routing-options graceful-restart
set protocols ospf graceful-restart ?Possible completions: disable Disable OSPF graceful restart capability helper-disable Disable graceful restart helper capability notify-duration Time to send all max-aged grace LSAs (1..3600 seconds) restart-duration Time for all neighbors to become full (1..3600 seconds)
disables GR for OSPF. As mentioned earlier, helper mode is enabled by
default; use the
statement to disable this mode. The remaining two statements,
restart-duration, allow you to modify the
default values of OSPF GR timers.
There is one additional requirement for GR protocol extensions.
Once all the neighbors enter into GR helper mode, any subsequent
topology change forces the neighbor to terminate the helper state and
bring down OSPF adjacencies. However, certain network designs can
safely ignore the fact that the network topology has changed.
Depending on how network traffic flows, flaps on out-of-the-way
interfaces do not impact network traffic flowing through the failed
router. So, in this scenario, you want to prevent GR from terminating
and bring down OSPF adjacencies. To work around unnecessary GR
termination, configure OSPF to ignore any topology changes and to
ignore new LSA updates by configuring the
set protocols ospf graceful restart no-strict-lsa-checking
To understand how GR is implemented in IS-IS, we’ll again start by analyzing the basics of IS-IS protocol communications. Neighbor adjacencies between two IS-IS routers are formed through the exchange of IS-IS Hello messages. After this, IS-IS goes through several states and eventually establishes full neighbor adjacencies. IS-IS advertises its routing information using link-state update messages called link-state Protocol Data Units (PDUs), or LSPs. After routers establish adjacency, they send out Complete Sequence Number PDUs (CSNPs) containing a summary of the link-state information available for advertisement. The neighbors receiving the CSNPs check them against their own database; if there is any discrepancy, the neighbors send a Partial Sequence Number PDU (PSNP) requesting a specific subset of the information that is out of sync with its database. The response to the PSNP is an LSA in the form of either a PSNP on P2P links or a CSPN on broadcast links. Eventually, all routers in the same area synchronize their link-state databases.
IS-IS provides GR support through use of the GR TLV, type code 221, which is carried in the Hello PDU. Two important bits in this message are the request restart (RR) bit and the restart acknowledgment (RA) bit. Under normal conditions, both bits are clear and are set to 0. When a router restarts, it requests support from its neighbors by sending a type 221 TLV with the RR bit “on” (set to 1). Helpers capable of helping acknowledge this PDU with a response that clears the RR bit, setting it back to 0 and setting the RA bit to 1. Additionally, the PDU response also contains the default hold-time value of 90 seconds. This time is how long the neighbor will keep the adjacency up and will keep all entries in the link-state database intact.
By default, every router running JUNOS is in this mode, without any configuration. If no network activity exists, all routers stay in this mode forever.
Upon receiving a Grace-TLV (type 221) from the neighboring router, a possible helper will be promoted into this role. It will maintain the adjacency, mark all neighbors’ routes as stale, and continue to forward the traffic toward the restarting router for a limited period of time. Once the neighbor finishes its restart event, or after the restart timer expires, the old routes are flushed from the routing table. If the neighbor does not recover before the restart timer expires, the adjacencies are also brought down.
The router that has or is about to undergo the restart event marks all of its routes as stale, in a form of kernel route, and sends the Grace-TLV to its neighbors requesting help until its control plane is fully recovered.
As we mentioned in the section about OSPF and GR, you configure GR globally, for all routing protocols, in the routing-options hierarchy or in the IS-IS configuration hierarchy, you can disable it just for IS-IS, and you can set GR timers specific to IS-IS:
graceful-restart ?Possible completions: disable Disable graceful restart helper-disable Disable graceful restart helper capability restart-duration Maximum time for graceful restart to finish (seconds)
Currently, IS-IS does not support an “ignore topology change” statement similar to that supported by OSPF.
The Internet is where we can see the real advantage of GR protocol extensions. At the time of this writing, the Internet routing table contains approximately 276,000 BGP routes, as shown in Figure 4-5 (and as graphed at http://bgp.potaroo.net). Consider a peering router at a major ISP peering point with three or four or five copies of this Internet table. Now imagine a control plane going down on this router and then recovering within a matter of one or two minutes. With incremental route update processing, that’s at least 550,000 (276,000 routes × 2 copies of the table) or more routes that need to be recalculated. Events such as this can have a drastic effect on the stability of the global Internet, but implementing GR in BGP can be a large component in helping to solve the instability problem.
As we did with OSPF and IS-IS, let’s start by looking at how BGP manages its peering sessions. A BGP neighbor session is established through BGP open messages, and is maintained by BGP keepalive messages. Routing information is advertised with BGP update messages, and a neighbor session is torn down with BGP notification messages. To finalize the routine update process, the peer sends an end-of-RIB marker message (EOR), which is a simple BGP update message with no prefixes. The session itself is maintained through BGP keepalives sent at periodic intervals. Loss of keepalives for the period of the dead interval results in a BGP session being torn down.
A capability announcement for the BGP GR is signaled using special bits. A restart bit (RS) signals that the router is going through the restart event, while a forwarding bit (FS) signals that the router is capable of retaining the forwarding state during a restart event. During initial GR negotiation, both bits are set to 0. In contrast to the OSPF implementation, the restarting router signals the restart event after the restart procedure has begun.
Additionally, GR for BGP is negotiated for each BGP family. Different vendors, and even different JUNOS versions, may support a limited set of families. Therefore, the actual negotiation comes in handy. You can check what is being proposed and what is negotiated by displaying information about the BGP neighbor:
show bgp neighborPeer: 192.168.36.1+3098 AS 65010 Local: 192.168.24.1+179 AS 65010 Type: Internal State: Established Flags: <> Last State: OpenConfirm Last Event: RecvKeepAlive Last Error: None Options: <Preference LocalAddress HoldTime GracefulRestart Refresh> Local Address: 192.168.24.1 Holdtime: 90 Preference: 170 Number of flaps: 2 Error: 'Cease' Sent: 2 Recv: 0 Peer ID: 192.168.36.1 Local ID: 192.168.24.1 Active Holdtime: 90 Keepalive Interval: 30 NLRI for restart configured on peer: inet-unicast NLRI advertised by peer: inet-unicast NLRI for this session: inet-unicast Peer supports Refresh capability (2) Restart time configured on the peer: 120 Stale routes from peer are kept for: 300 Restart time requested by this peer: 120 NLRI that peer supports restart for: inet-unicast NLRI peer can save forwarding state: inet-unicast NLRI that peer saved forwarding for: inet-unicast NLRI that restart is negotiated for: inet-unicast NLRI of received end-of-rib markers: inet-unicast NLRI of all end-of-rib markers sent: inet-unicast Table inet.0 Bit: 10000
To understand the BGP GR process, let’s examine the signaling involved in a single restart event.
Let’s assume that both GR and GRES are enabled on the restarting router. During the GRES event, all existing routing entries are saved in the router’s forwarding table. As soon as the RPD on the newly restarted control plane comes up, it sends an open message to all its neighbors with the RS and FS bits set to 1, requesting continuation of forwarding based on existing routing entries and requesting help in building a new routing table. The RPD marks all previous entries as stale routes and continues processing update messages received from its neighbors. After all the neighbors finish sending their routing update messages, they send end-of-RIB markers to actually mark the successful completion of the update process. The restarting router waits until IGP convergence is complete. Only then does it run the BGP route selection algorithm and activate the received routes. The active routes are pushed to the forwarding table on the PFE complex. Stale routes expire in three minutes, after which they are flushed from the forwarding table. The presence of duplicate routes during this time is not an issue because the stale routes are marked with the preference of 255, and BGP routes have a default preference of 170.
When a peer receives a new BGP open message with the RS and FS bits set to 1, it closes the TCP session belonging to the old BGP peering session. It then marks all the old BGP routes as stale routes and retains them for three minutes while it continues forwarding based on old routes. When the new BGP session is established and the new update message with the end-of-RIB marker is received, the peer uses the new routes and flushes all the stale routes.
set protocols bgp graceful-restart ?Possible completions: <[Enter]> Execute this command disable Disable graceful restart restart-time Restart time used when negotiating with a peer (1..600) stale-routes-time Maximum time for which stale routes are kept (1..600) | Pipe through a command
NLRI for restart configured on peer: inet-unicast NLRI advertised by peer: inet-unicast NLRI for this session: inet-unicast Peer supports Refresh capability (2) Restart time configured on the peer: 120 Stale routes from peer are kept for: 300 Restart time requested by this peer: 120 NLRI that peer supports restart for: inet-unicast NLRI peer can save forwarding state: inet-unicast NLRI that peer saved forwarding for: inet-unicast NLRI that restart is negotiated for: inet-unicast NLRI of received end-of-rib markers: inet-unicast NLRI of all end-of-rib markers sent: inet-unicast