Non-Stop Active Routing

Having a highly available routing node in the middle of your network running GRES with two REs is a great way to build a core infrastructure supporting seven 9s network uptime. The problem is that ensuring a zero-loss environment requires GRES to be supplemented with GR protocol extensions on all running protocols on all surrounding routers. This is easily done in the core network, where you have the control of platform selection, protocol support, and software version. However, that is not the case at the edge of the network, where you face customer-controlled devices. More often than not, you do not know the hardware type or software versions and protocol support of your customer peering routers. Moreover, even if you do know, you cannot control their hardware and software selection. Thus, you cannot rely on GR protocol extensions to integrate seamlessly into your node redundancy design with GRES.

A provider also cannot always rely on network-based availability by means of redundant network paths toward the customer. Often, the customer is dual-homed to different service providers for better redundancy. It is most likely that catastrophic events, such as a natural disaster, will not affect different service providers in the same manner. While one ISP might lose portions of its data center and upstream peerings, another ISP might be located far away from the disaster site and will be able to preserve all routing state with the rest of the Internet.

Additionally, you do not want your customers to have any idea that your router has gone through a failure event at all. You’d really prefer to provide as little negative information about your network to your customers as you possibly can.

All of these reasons create a strong case for a different implementation of high availability on edge devices. Non-Stop Active Routing (NSR) is the perfect solution, not only for the edge of the network, but also for the core. With its seamless work in the background, NSR should be used wherever applicable. Figure 4-6 shows a network design based on high availability tools.

Network design based on high availability tools

Figure 4-6. Network design based on high availability tools

Implementation Details and Configs

NSR is a relatively new software feature available in JUNOS, included since the JUNOS 8.4 release on platforms with dual REs. The basic concept of NSR is that the backup RE can maintain all peering relationships with its neighbors during an RE switchover event, without the help of protocol extensions such as GR. While NSR takes care of routing protocols, forwarding redundancy is still built on the concept of GRES. Therefore, to provide minimal traffic loss, GRES must be supported and configured in addition to NSR.

The backup RE provides support for NSR by actively running an RPD process. The aim is that the RPD on the backup RE is initially in sync with the primary RPD and in sync with the rest of the network afterward. Figure 4-7 illustrates the NSR state replication process.

NSR state replication process

Figure 4-7. NSR state replication process

During the initial startup of the RPD process on the backup RE, all routing state from the primary RE is copied by means of rtsock messages using TCP. The private routing instance _private_instance is used as a means of RPD replication, ensuring that the RPD on the backup RE does not start routing from the null state. This routing instance also prevents delays in updates, because the backup RPD might have to wait for a long time for all neighbors’ states to be refreshed. Certain protocols never readvertise their routing information to neighbors unless specifically requested, with the result that the backup RE remains out of sync with the rest of the world.

Once the RPD is up and running and all of its routing information is populated in the relevant tables, it actively snoops on all incoming and outgoing protocol messages. Moreover, it processes all incoming messages, and adds routes to or removes routes from the backup routing table as needed. During this process, the RPD resolves next hop information as needed, just as the primary RPD does. The RPD also snoops all locally generated messages from the primary RE to its neighbors. Therefore, the backup RE does not keep its state up-to-date with the primary RE. Rather, it keeps its state up-to-date with the rest of the network.

With NSR, the forwarding state and active kernel state are replicated the same way they are replicated in GRES: using the ksyncd daemon, which provides Non-Stop Forwarding (NSF) support.

To configure NSR support, use the following command:

[edit]
lab@r1# set routing-options nonstop-routing

As stated earlier, NSR is closely tied to GRES and uses GRES for NSF support. Thus, you must configure GRES as well:

[edit]
lab@r1# show chassis 
chassis {
    redundancy {
        routing-engine 0 master;
        routing-engine 1 backup;
        failover {
            on-loss-of-keepalives;
            on-disk-failure;
        }
        graceful-switchover;
    }
    routing-engine {
        on-disk-failure reboot;
    }
}

Because both of the REs have the same configuration, you do not need to enable commit synchronization. In earlier releases, omitting the commit synchronize statement would generate a reminder to use the commit synchronize command for all commits. To avoid potentially forgetting to use this command, you must include the actual command in the router configuration file, and then each time you execute a commit command, JUNOS executes the commit synchronize command:

{master}[edit]
lab@r1# set system commit ?   
Possible completions:
synchronize          Synchronize commit on both Routing Engines by default
{master}[edit]
lab@r1# set system commit synchronize 

To verify and troubleshoot NSR, use the following command:

{master}[edit]
lab@r1# run show task replication    
        Stateful Replication: Enabled
        RE mode: Master

    Protocol                Synchronization Status
    IS-IS                   Complete              

{master}[edit]
lab@r1# 

Also execute show route, show bgp neighbor, show ospf database, and similar commands, and compare the output.

The reason network engineers have jobs is that not everything works as expected. Sometimes the problem is a deficiency in the system; however, often the issue is that our expectations are not aligned with the intended design goal of the JUNOS feature.

Note

“Unlike great literature, routing protocol functionality is not subject to personal interpretation.” — Matthew Shaul, 2001

Because JUNOS supports traceoptions in many portions of the configuration, you might expect to find something similar for NSR support. In fact, you can turn on the nsr flag under traceoptions in all major protocols:

[edit]
lab@r1# set protocols isis traceoptions flag nsr-synchronization detail

Get JUNOS High Availability now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.