Having a highly available routing node in the middle of your network running GRES with two REs is a great way to build a core infrastructure supporting seven 9s network uptime. The problem is that ensuring a zero-loss environment requires GRES to be supplemented with GR protocol extensions on all running protocols on all surrounding routers. This is easily done in the core network, where you have the control of platform selection, protocol support, and software version. However, that is not the case at the edge of the network, where you face customer-controlled devices. More often than not, you do not know the hardware type or software versions and protocol support of your customer peering routers. Moreover, even if you do know, you cannot control their hardware and software selection. Thus, you cannot rely on GR protocol extensions to integrate seamlessly into your node redundancy design with GRES.
A provider also cannot always rely on network-based availability by means of redundant network paths toward the customer. Often, the customer is dual-homed to different service providers for better redundancy. It is most likely that catastrophic events, such as a natural disaster, will not affect different service providers in the same manner. While one ISP might lose portions of its data center and upstream peerings, another ISP might be located far away from the disaster site and will be able to preserve all routing state with the rest of the Internet.
Additionally, you do not want your customers to have any idea that your router has gone through a failure event at all. You’d really prefer to provide as little negative information about your network to your customers as you possibly can.
All of these reasons create a strong case for a different implementation of high availability on edge devices. Non-Stop Active Routing (NSR) is the perfect solution, not only for the edge of the network, but also for the core. With its seamless work in the background, NSR should be used wherever applicable. Figure 4-6 shows a network design based on high availability tools.
NSR is a relatively new software feature available in JUNOS, included since the JUNOS 8.4 release on platforms with dual REs. The basic concept of NSR is that the backup RE can maintain all peering relationships with its neighbors during an RE switchover event, without the help of protocol extensions such as GR. While NSR takes care of routing protocols, forwarding redundancy is still built on the concept of GRES. Therefore, to provide minimal traffic loss, GRES must be supported and configured in addition to NSR.
The backup RE provides support for NSR by actively running an RPD process. The aim is that the RPD on the backup RE is initially in sync with the primary RPD and in sync with the rest of the network afterward. Figure 4-7 illustrates the NSR state replication process.
During the initial startup of the RPD process on the backup RE,
all routing state from the primary RE is copied by means of rtsock
messages using TCP. The private routing
instance _private_instance
is used as
a means of RPD replication, ensuring that the RPD on the backup RE does
not start routing from the null state. This routing instance also
prevents delays in updates, because the backup RPD might have to wait
for a long time for all neighbors’ states to be refreshed. Certain
protocols never readvertise their routing information to neighbors
unless specifically requested, with the result that the backup RE
remains out of sync with the rest of the world.
Once the RPD is up and running and all of its routing information is populated in the relevant tables, it actively snoops on all incoming and outgoing protocol messages. Moreover, it processes all incoming messages, and adds routes to or removes routes from the backup routing table as needed. During this process, the RPD resolves next hop information as needed, just as the primary RPD does. The RPD also snoops all locally generated messages from the primary RE to its neighbors. Therefore, the backup RE does not keep its state up-to-date with the primary RE. Rather, it keeps its state up-to-date with the rest of the network.
With NSR, the forwarding state and active kernel state are
replicated the same way they are replicated in GRES: using the ksyncd
daemon, which provides Non-Stop
Forwarding (NSF) support.
To configure NSR support, use the following command:
[edit]
lab@r1# set routing-options nonstop-routing
As stated earlier, NSR is closely tied to GRES and uses GRES for NSF support. Thus, you must configure GRES as well:
[edit]
lab@r1# show chassis
chassis {
redundancy {
routing-engine 0 master;
routing-engine 1 backup;
failover {
on-loss-of-keepalives;
on-disk-failure;
}
graceful-switchover;
}
routing-engine {
on-disk-failure reboot;
}
}
Because both of the REs have the same configuration, you do not
need to enable commit synchronization. In earlier releases, omitting
the commit synchronize
statement would generate a reminder to use the commit synchronize
command for all commits. To
avoid potentially forgetting to use this command, you must include the
actual command in the router configuration file, and then each time you
execute a commit
command, JUNOS executes the commit synchronize
command:
{master}[edit]
lab@r1# set system commit ?
Possible completions:
synchronize Synchronize commit on both Routing Engines by default
{master}[edit]
lab@r1# set system commit synchronize
To verify and troubleshoot NSR, use the following command:
{master}[edit]
lab@r1# run show task replication
Stateful Replication: Enabled
RE mode: Master
Protocol Synchronization Status
IS-IS Complete
{master}[edit]
lab@r1#
Also execute show route
, show bgp
neighbor
, show ospf database
, and similar
commands, and compare the output.
The reason network engineers have jobs is that not everything works as expected. Sometimes the problem is a deficiency in the system; however, often the issue is that our expectations are not aligned with the intended design goal of the JUNOS feature.
Note
“Unlike great literature, routing protocol functionality is not subject to personal interpretation.” — Matthew Shaul, 2001
Because JUNOS supports traceoptions
in many
portions of the configuration, you might expect to find something
similar for NSR support. In fact, you can turn on the nsr
flag under traceoptions
in all major protocols:
[edit] lab@r1#set protocols
isis
traceoptions flag nsr-synchronization detail
Get JUNOS High Availability now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.