O'Reilly logo

Cloud Native Data Center Networking by Dinesh G. Dutt

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. Network Operating System Choices

All models are wrong, but some are useful

George E.P.Box

All operating systems suck, but Linux just sucks less

Linus Torvalds

The breakup at the heart of cloud-native data center networking was engineered by the network operators dissatisfied with the degree of control the vendors were willing to cede. Whenever a breakup occurs, the possibilities seem limitless, and there’s a natural tendency to explore new ideas and ways of being that didn’t seem possible before. But after a while we realize that some of the old ways of being were actually fine. Just because we did not like watching yet another episode of X Files with pizza every Friday evening did not mean that pizza was the problem. We don’t need to throw out the things we like with the things we do not.

Thus it was that the evolution of the modern big data and cloud-native applications gave rise to a whole new set of ideas about how to do networking. Some were born out of the necessity of the time, some were truly fundamental requirements, and some are still testing their mettle in the new world. In this chapter, we’ll follow the possibilities and models explored by the software half (the better half?) of network disaggregation.

The chapter aims to help the reader answer questions including:

  • What are the primary requirements of a cloud-native network operating system (NOS)?

  • What are Openflow and SDN? Where do they make sense and where do they not?

  • What are the possible choices for a NOS in a disaggregated box?

  • How do the models compare with the requirements of cloud-native NOS?

Requirements of a Network Device

In the spirit of Hobbes, the state of NOS around 2010 could be summarized as: proprietary, manual, and embedded. The NOS was designed around the idea that network devices were appliances and so the NOS had to function much like an embedded OS, including manual operation. What this meant was that the NOS stood in the way of everything that the cloud-native data center operators wanted. James Hamilton, a key architect of Amazon Web Services (AWS), wrote a well read blog post titled “Datacenter Networks are in my Way”. discussing these issues.

Operators buying equipment on a vast scale were extremely conscious of costs. The cost of both the equipment, and the cost of operating it at scale. To control the cost of running the network, the operators wanted to change the operations model. Thus, a primary factor distinguishing a cloud-native networking infrastructure from earlier models is the focus on efficient operations. In other words, reduce the burden of managing network devices. The other distinguishing characteristic is the focus on agility: speed of innovation, the speed of maintenance and upgrades, and the speed with which new services and devices can be rolled out. From these two needs arose the following requirements of any network device in the cloud-native era:

Programmability of the device

For efficient operations that can scale, automation is a key requirement. Automation is accomplished by making both device configuration and device monitoring programmable.

Ability to run third-party applications

This includes running configuration and monitoring agents that don’t rely on antiquated models provided by switch vendors. It also includes writing and running operator-defined scripts in modern programming languages such as Python. Some preferred to run these third-party apps as containerized apps, primarily to isolate and contain a badly behaving third-party app.

Ability to replace vendor-supplied components with their equivalents

An example of this requirement is to replace the vendor-supplied routing suite with an open source version. This is a less common requirement for most network operators, but is essential for many larger scale operators to try out new ideas that may solve problems specific to their network. This is also something that researchers in the academia would like. For a very long time, academia has had very little to contribute to the three lowest layers of the networking stack, including routing protocol innovations. The network was something you innovated around rather than with. The goal of this requirement was to crack open the lower layers to let parties other than the vendors make relevant contributions in networking. As an example of the innovation that can result, imagine the design, testing, and deployment of new path optimization algorithms or routing protocols.

Ability for the operator to fix bugs in the code

This allows the operators to adapt to their environment and needs as quickly as they want, not just as quickly as the vendor can.

In other words, the network device has to behave more like a server platform than the embedded box or specialized appliance that it had been for a long time.

The Rise of SDN and Openflow

The first salvo in the search for a better operating model came from the academia, who were battling a different problem. The networking research community faced a seemingly insurmountable problem: how to perform useful research with a vertically integrated switch made up of proprietary switching silicon and a NOS that was not designed to be a platform? Neither the network vendors nor the network operators would allow any arbitrary code to be run on the switch. So the academics came up with an answer that they hoped would overcome these problems. Their answer, expressed in the influential Openflow paper, was based on the following ideas:

  • Use the flow tables available on most packet switching silicon to determine packet processing. This would allow researchers to dictate novel packet forwarding behavior.

  • Allow the slicing of the flow table to allow operators to run production and research data on the same box. This would allow researchers to test out novel ideas on real networks without messing up the production traffic.

  • Define a new protocol to allow software running remotely to program these flow tables and exchange other pieces of information. This would allow researchers to remotely program the flow tables, eliminating the need to run software on the switch itself, thereby sidestepping the non-platform model of the then current switch NOSes.

Flow tables (usually called Access Control List tables in most switch literature) are lookup tables that use at least the following fields in their lookup key: source and destination IP addresses, the L4 protocol type (TCP, UDP, etc) and the L4 source and destination ports (TCP/UDP ports). The result of the lookup could be one of the following actions:

  • forward the packet to a different destination than provided by IP routing or bridging

  • drop the packet

  • perform additional actions such as counting, Network Address Translation (NAT), etc.

The node remotely programming the flow tables is called the controller. The controller runs software to determine how to program the flow tables and it programs openflow nodes. For example, the controller could run a traditional routing protocol and set up the flow tables to use only the destination IP address from the packet, causing the individual openflow node to function like a traditional router. But since the routing protocol is not a distributed application—an application running on multiple nodes with each node independently deciding the contents of the local routing table—the researchers could create new path forwarding algorithms without any support from the switch vendor. Moreover, the researchers could also try out completely new packet forwarding behavior very different from traditional IP routing.

This separation of data plane (the packet forwarding behavior) and control plane (the software determining the population of the tables governing packet forwarding) would allow researchers to make changes to the control plane from the comfort of a server OS. The NOS would merely manage resources local to the device. Based on the protocol commands sent by the control plane program, the NOS on the local device would update the flow tables.

This network model, having a central control plane and a distributed data plane, became known as software defined networking (SDN). The idea was to make the flow tables into a universal, software abstraction of the forwarding behavior built into switching silicon hardware. This was a new approach to breaking open the vendor-specific world of networking.

More Details About SDN and Openflow

Figure 4-1 shows the two central components of a switch with SDN. Figure 4-1(a) shows how Openflow manages a network run by devices with flow tables governing packet forwarding as shown in Figure 4-1(b).

cdcn 0401
Figure 4-1. Openflow and SDN

Figure 4-1(a) shows that configuration and monitoring can be done at a central place. The network operator has a god’s view of the entire network. The hope was that this model would also relieve network operators of trying to translate their network-wide desired behavior into individual box-specific configuration.

Figure 4-1(b) shows that the switching silicon can be quite simple with Openflow. The original flow tables consisted of only a handful of fields. To allow for interesting ideas, Openflow advocates wanted to make the flow table lookup key very generic: just include all the fields of the packet header up to a specific size—say the first 40 or 80 bytes. An example of this approach that was often cited was the automatic support for new packet headers, like new tunnel headers. We discuss tunnels in more detail in Chapter 6, but for the purpose of this discussion, assume that the tunnel header is a new packet header inserted before an existing packet header such as the IP header. Since the lookup involved all the bytes in the packet up to a certain length, the table lookup could obviously handle the presence of these tunnel headers, or other new packet headers, if programmed to recognize them.

The flow tables on traditional switching silicon were quite small compared to the routing or bridging tables. It was quite expensive to build silicon with very large flow tables. So, the Openflow paper also suggested to not pre-populate the flow tables, but to use them like a cache. If there was no match for a packet in the flow table, send the packet to the controller. The central controller would consult its software tables to determine the appropriate behavior for the packet and also program the flow table so that subsequent packets for this flow would not be sent to the central controller.

The Trouble with Openflow

Openflow ran into trouble almost immediately out of the gate. The simple flow table lookup model didn’t really work even with simple routing because it didn’t handle ECMP (introduced in “Routing as the Fundamental Interconnect Model”). On the control plane side, sending the first packet of every new flow to a central control plane didn’t scale. These were both fundamental problems. In the networking world, the idea of sending the first packet of a flow to the control plane had already been explored in several commercial products and deemed a failure. This was because they didn’t scale, and the caches thrashed far more frequently than the developers originally anticipated.

The problems didn’t end there. Packet switching involves a lot more complexity than a simple flow table lookup. Multiple tables are looked up, and sometimes the same table looked up multiple times to determine packet switching behavior. How would you control the ability to decrement a TTL or not, to perform checksums on the IP header, handle network virtualization and so on. Flow table invalidation was another problem. When should the controller decide to remove a flow. The answer was not always straightforward.

To handle all this and the fact that a simple flow table isn’t sufficient, Openflow 1.0 got updated to version 1.1, which was so complex no packet switching silicon vendor knew how to support it. From that arose version 1.2, which was a little more pragmatic. The designers also revised the control plane to pre-populate the lookup tables and didn’t require sending the first packet to a controller. But most network operators had begun to lose interest at this point. As a data point, around 2013 when I started talking to customers as an employee of Cumulus Networks, Openflow and SDN were all I got asked about. A year and a half or two later, it rarely came up. I also don’t see it come up with operators I spoke to who were not Cumulus customers. It is rarely deployed to program data center routers, except for one very large operator, Google.

Openflow didn’t succeed as much as its hype had indicated for at least the following reasons:

  • Openflow conflated multiple problems into one. The programmability that users desired was for constructs such as configuration and statistics. Openflow presented an entirely different abstraction and required reinventing too much to solve this problem.

  • The silicon implementation of flow tables couldn’t scale inexpensively compared to routing and bridging tables.

  • Data center operators didn’t care about research questions, such as whether to forward all packets with a TTL of 62 or 12 differently from other packets, or how to perform a lookup based on 32 bits of an IPv4 address combined with the lower 12 bits of a MAC address.

  • Openflow’s abstraction models were either too limiting or too loose. And like other attempts to model behavior, such as Simple Network Management Protocol (SNMP), these attempts ended up with more vendor lockin instead of less. Two different SDN controllers would program the Openflow tables very differently. Consequently, once you chose a vendor for one controller, you couldn’t really switch vendors easily.

  • Finally, people realized that Openflow presented a very different mindset from the classical networking model of routing and bridging, and therefore a lower maturity level. The Internet, the largest network in the world, already ran fairly well (though not without its share of problems). People did not see the benefit of throwing out what worked well (distributed routing protocols) along with the thing that did not (the inability to programmatically access the switch).

OVS

Open Virtual Switch (OVS) is a well regarded open source implementation of the Openflow switch on Linux. ovs-vswitchd is the user space application that implements the Openflow switch, which includes the Openflow flow tables and actions. It does not use the lookup tables or packet forwarding logic of the Linux kernel. So ovs-vswitchd re-implements a lot of standard code such as ARP, IP, and MPLS. The flow tables in ovs-vswitchd are typically programmed by an Openflow controller. At the time of this writing, OVS was at version 2.11.90 and supports Openflow protocol version 1.5. The website provides more details on all this. Use of OVS is the most widespread use of Openflow I’ve encountered.

OVS comes with a component called OVSDB server or just OVSDB, which is used to configure and query OVS itself. The Openflow controller can query the OVS state via OVSDB using the schema exported by OVSDB server for such queries. The OVSDB specification has been published as RFC 7047 by the Internet Engineering Task Force (IETF). This relation between the Openflow controller, OVSDB and OVS is shown in Figure 4-2.

cdcn 0402
Figure 4-2. OVS Components & Openflow Controller

OVSDB is also used by hypervisor and Openstack vendors to configure OVS with information about the VMs or endpoints they control on hardware switches. For example, this latter approach is used to integrate SDN-like solutions that configure OVS on hypervisors with bare-metal servers that communicate with the VMs hosted by the hypervisors. In such a model, the switch NOS vendor supplies an OVSDB implementation that translates the configuration specified by OVSDB into its NOS equivalent and into the switching silicon state. VMware’s NSX is a prominent user of OVSDB. Switches from various prominent vendors, old and new, support such a version of OVSDB to get their switch certified for use with NSX.

OVS can be configured to use Intel’s Data Path Development Kit (DPDK) to perform high speed packet processing in user space.

The Effect of SDN and Openflow on Network Disaggregation

Openflow and the drums of SDN continue to beat. They are by no means moribund. In some domains such as Wide Area Networks (WAN), the introduction of Software Defined WAN (SD-WAN) has been quite successful because it did present a model that was simpler than the ones it replaced. Also, as a global traffic optimization model on the WAN, it was simpler and easier to use and deploy than the complicated RSVP-TE model. Others have come up with a BGP-based SDN. Here, BGP is used instead of Openflow as the mechanism to communicate path information for changing the path of certain flows, not every flow.

But within the data center, as a method to replace distributed routing protocols with a centralized control plane and a switch with Openflow-defined lookup tables, my observations indicate that it has not been successful. Outside of a handful of operators, I have rarely encountered this model for building data center network infrastructure.

The authors of Openflow weren’t trying to solve the problems of the cloud-native data center or of disaggregated networking. Their promise to provide something like a programmable network device intrigued many people and drew them down the path of Openflow.

But SDN muddied the waters of network disaggregation quite a bit. First, it made a lot of network operators think that all their skills acquired over many years via certifications and practice—skills in distributed routing protocols and bridging—were going to be obsolete. Next, traditional switch vendors used this idea of programming to say that network operators had to learn coding and become developers to operate SDN-based networks. This made many of them averse to even considering network disagregation. And I know of other operators who spent a lot of time trying to run proof-of-concept (POC) experiments to verify the viability of Openflow and SDN in building a data center network, before giving up and running back into the warm embrace of traditional network vendors. Furthermore, many of the vendors who sprang up around the use of SDN to control data center network infrastructure supported only bridging, which as we discussed doesn’t scale.

NOS Design Models

If Openflow and SDN didn’t really take off as predicted, then what is the answer to the NOS in a disaggregated world? Strangely, the Openflow paper alluded to a different approach, but rejected it as impractical. From the paper:

One approach - that we do not take - is to persuade commercial “name-brand” equipment vendors to provide an open, programmable, virtualized platform on their switches and routers so that researchers can deploy new protocols.

Alas, the cliché of the road not taken—again. Let us examine the various NOS models in vogue. We start by looking at the parts that are common before examining the parts that are different.

The two common elements in all modern NOS are:

Linux

This is the base OS for just about every NOS. By base OS, I mean the part that does process and memory management and non-networking device I/O, primarily storage.

Intel x86/ARM processors

NOS need a CPU to run on. Intel and ARM are the most common CPU architecture for switches. ARM is the most common processor family used on lower end switches: those that have 1GE Ethernet ports or lower. The 10GbE and higher speed switches using Intel are typically dual core processors with 2-8 GB of memory.

These two elements can be found in both the new disaggregated NOSes and the traditional data center switch vendors. The most well-known holdout from this combination is Juniper, whose Junos uses FreeBSD rather than Linux as its base OS. However, the next generation of Junos, called Junos OS Evolved, runs on Linux instead of FreeBSD.

An operating system acts as the moderator between resources and the applications that want to use those resources. For example, CPU is a resource, and access to it by different applications is controlled by process scheduling within the OS.

Every OS has a user space and a kernel space, distinguished primarily by protected memory access. Software in kernel space, except for device drivers, has complete access to all hardware and software resources. In the user space, each process has access only to its own private memory space; anything else it needs requires a request to the kernel. This separation is enforced by the CPU itself via protection rings or domains provided by the CPU. An operating system also provides a standard Application Programming Interface (API) to allow applications to request, use, and release resources.

In any switch, disaggregated or otherwise, the CPU complex is connected to the packet switching silicon, the management Ethernet port, storage, etc. as shown in Figure 4-3.

cdcn 0403
Figure 4-3. CPU and Packet Switching Silicon in a Switch Complex

The packet switching silicon is connected to the CPU via a peripheral input/output bus, usually PCIe (Peripheral Component Interconnect express). The interconnect bandwidth between the packet switching silicon and the CPU complex is only a fraction of the switching silicon capacity. For example, a common switching silicon available circa 2018 has 32 ports of 100GE, which adds up to 3.2 Terabits per second. However, the interconnect bandwidth varies between 400 Mbps and at most 100 Gbps, though it rarely hits the upper bound.

The main tasks of a NOS are to run control and management protocols, maintain additional state such as counters, and set up the packet switching silicon to forward packets. Except when Openflow is used, the NOS on a switch is not in the packet forwarding path. A small amount of traffic is destined to the switch itself, and those are handled by the NOS. Packets destined to the switch are usually control plane packets, such as those sent by a routing protocol or bridging protocol. The other kind of packet that the NOS sees consists of erroneous packets, for which it has to generate error message packets.

Thus, the NOS on the switch sees a minuscule percentage of the total bandwidth supported by the packet switching silicon. For example, the NOS may process at most 150,000 packets per second, while the switching silicon will be processing over 5 billion packets per second. The NOS uses policers to rate limit traffic hitting it and prioritizes control protocol traffic over the others. This ensures that the NOS is never overwhelmed by packets sent to it.

All the parts discussed so far are the same across all the NOSes designed from the turn of the century. The reliance on embedded or real-time OSes such as QNX and VxWorks is so past century, though a couple of vendors (see this NOS list maintained by OCP) still seem to rely on them.

Modern NOSes however differ in the following areas, each of which is examined in upcoming sections:

  • Location of the switch network state

  • Implementation of the switching silicon driver

  • Programmating model

Each of these pieces affects the extent of control a network engineer or architect is able to exercise. The first two considerations are interconnected, but I have broken them up to highlight the effect of each on the network operator. Furthermore, a solution chosen for one of these considerations doesn’t necessarily dictate the answer to the others. We will conclude the comparison by looking at how each of the choices address the requirements presented in “Requirements of a Network Device”.

Location of Switch Network State

Switch network state represents everything associated with packet forwarding on the switch, from the lookup tables to access control lists to counters. This piece is important to understand because it defines to a large extent how third-party applications function in the NOS.

There are three primary models for this.

Vendor-Specific User Space Model

The most common model at the time of this writing stores the network state only in NOS-vendor specific software in the user space. In other words, the ultimate source of truth in the NOS in such a model is the vendor-specific software. In this model, the vendor-supplied control protocol stack writes directly to this vendor specific data store. Cisco’s NXOS and DPDK-based NOSes implement this model.

Hybrid Model

The second model, which is the hybrid model, is a variation of the first model. The ultimate source of truth in these cases is the same as the first model, the vendor-specific user space vendor stack. However, the NOS also synchronizes parts of this state with the equivalent parts in the kernel. For example, in this model, the NOS synchronizes the routing table with the kernel routing table. Only a subset of the total state is synchronized. For example, I know of a NOS that synchronizes everything except bridging state. Another NOS synchronizes only the interfaces and the IP routing state, including the IP neighbor state. The rest are not synchronized in the kernel. In most cases, vendors also synchronize changes in the other direction: any changes made to the kernel in those structures are also reflected into to the vendor’s proprietary stack.

So is there a rationale behind which states are synchronized, why those and not the others? I have heard different reasons from different NOS vendors. Some NOS vendors treat the switching silicon as the only packet forwarder in the system. When the CPU sends a packet to the switching silicon, it expects that chip to behave very much as if the packet came in on a switching silicon port, and to forward the packet in hardware; the same is assumed in the reverse direction. So, the only state that is synchronized is whatever is necessary to preserve this illusion. In this way, the only packet forwarding model used is the switching silicon’s model. In other cases, the vendor’s user space blob is synchronized with the kernel state only to the extent that it allows certain large network operators that use only routing functions to write their own applications on top of such a NOS. This is done primarily to retain (or acquire) the business of such large scale network operators.

In either case, the vendor-provided applications, such as control protocol suites, update state directly in the vendor-specific user space part rather than in the kernel.

Arista’s EOS, Microsoft’s SONIC, IP Infusion’s OcNOS, Dell’s Openswitch and many others follow this hybrid model, albeit with each differing in what parts are synchronized with the kernel.

Complete Kernel Model

The third model, one that is relatively new, is to use the Linux kernel’s data structures as the network state’s ultimate source of truth. The kernel today supports just about every critical construct relevant to the data center network infrastructure, such as IPv4, IPv6, network virtualization, and logical switches. Since packet forwarding is happening in the switching silicon, this model requires synchronizing counters and other state from the silicon with the equivalent state in the kernel. I’m assuming that in this model, the NOS vendor does not use any non-standard Linux API calls to perform any of the required functions. Cumulus Linux is the primary NOS vendor behind this model.

At some level, one can argue that most vendors follow the hybrid model, even someone like Cumulus, because not everything from the switching silicon is always available in the Linux kernel. However, the discussion here is about the general philosophy of the model rather than the state of an implementation at a particular point of time. The Linux kernel is sophisticated, mature, and evolving continuously through the work of a vibrant community.

Let us now consider what models permit when you want to run a third-party application on the switch itself. The vendor-specific user-space-only model requires modifications to the application to use the NOS vendor’s API. Even simple, everyday applications such as ping and SSH need to be modified to work on such a NOS. The kernel model will work with any application written to work with the standard Linux API. For example, researchers wanting to test out a new routing protocol can exploit this model. Furthermore, since the Linux API almost guarantees that changes will not break existing user-space applications (though they may need to be recompiled), the code will be practically guaranteed to work on future versions of the kernel too. In the hybrid model, if the third-party application relies only on whatever is synchronized with the kernel (such as IP routing), it can use the standard Linux API. Otherwise, it has to use the vendor-specific API.

Programming the Switching Silicon

Once the local network state is in the NOS, how is this state pushed into the switching silicon? Since the switching silicon is a device like an Ethernet NIC or disk, a device driver is obviously involved. The driver can be implemented as a traditional device driver—i.e., in the Linux kernel—or it can be implemented in user space.

At the time of this writing, the most common model puts the driver in user space. One of the primary reasons for this is that most switching silicon vendors provide the driver in user space only. The other reason is that until recently, the Linux kernel did not have an abstraction or model for packet switching silicon. The kernel provides many device abstractions such as block device, character device, net device, and so on. There was no such equivalent for packet switching silicon. Another reason for the prevalence of the user space model is that many consider it simpler to write a driver in user space than in the kernel.

How does the user space driver get the information to program the switching silicon? When the network state is in the user space, the user space blob communicates with the driver to program the relevant information into the silicon. When the ultimate source of truth is in the kernel, the user space driver opens a netlink socket and gets notifications about successful changes made to the networking state in the kernel. It then uses this information to program the switching silicon. These two models are shown in Figure 4-4.

cdcn 0404
Figure 4-4. How User Space Switching Silicon Driver Gets Information

The small light gray block just above the switch silicon represents a very low-level driver that’s responsible for interrupt management of the switching silicon, setting up DMA and so on.

SAI

Since the silicon switch driver is in the user space, most NOS vendors have defined their own Hardware Abstraction Layer (HAL) and a silicon-specific part that uses the silicon vendor’s driver. This has enabled them to support multiple types of switching silicon by mapping logical operations into silicon-specific code. The HAL definition is specific to each NOS vendor.

Microsoft and Dell defined a generic, NOS-agnostic HAL for packet switching silicon and called it Switch Abstraction Interface (SAI). They defined as much as they needed from a switching silicon to power Microsoft’s Azure data centers and donated the work to Open Compute Project (OCP). OCP adopted this as a subproject under the networking umbrella. Microsoft requires that any switching silicon support SAI for it to be considered in Microsoft Azure. Therefore, many merchant silicon vendors announced support for SAI; i.e., they provided a mapping specific to their switching silicon for the abstractions supported by SAI. As originally defined, SAI supported basic routing and bridging and little else. Development of SAI has continued since then and is an open source project available on GitHub.

SAI makes no assumptions about the rest of the issues we defined at the start of this section. In other words, SAI does not address the issues associated with supporting third-party applications on a NOS. SAI neither assumes that the packet forwarding behavior is the switching silicon’s nor does it assume that the Linux kernel is the ultimate source of truth.

Switchdev

To counter the lack of a kernel abstraction for packet switching silicon, Cumulus Networks and Mellanox along with some other kernel developers started a new device abstraction model in the Linux kernel. This is called switchdev.

Switchdev works as follows:

  • The Switchdev driver for a switching silicon communicates with the switching silicon and determines the number of switching ports the silicon is configured for. The Switchdev driver than instantiates as many netdev devices as there are configured silicon ports. These netdev devices then take care of offloading the actual packet I/O to the switching silicon. netdev is the Linux model for Ethernet interfaces and is used by NICs today. By using the same kernel abstraction, and adding hooks to allow offloading of forwarding table state, ACL state, interface state etc., all the existing Linux network commands such as ethtool, iproute2 commands etc. automatically work with ports from packet switching silicon.

  • Every packet forwarding data structure in the kernel provides well-defined back-end hooks that are used to invoke the switching silicon driver to offload the state from the kernel to the switching silicon. Switching silicon drivers can register functions to be called via these hooks. For example, when a route is added, this hook function calls a Mellanox switchdev driver which then determines if this route needs to be offloaded to the switching silicon. Routes that do not involve the switching silicon are typically not offloaded.

So when the kernel stores all the local network state, programming the switching silicon is more efficient using switchdev.

Switchdev first made an appearance in kernel release 3.19. Mellanox has provided full SwitchDev support for Mellanox’s Spectrum chipset.

Application Programming Interface

When the kernel holds the local network state in standard structures, the standard Linux kernel API is the programming API. It is not vendor specific and is governed by the Linux kernel community. When the user space holds the local network state, the NOS vendor’s API is the only way to reliably access all network state. This is why operators demand vendor-agnostic APIs. Despite the Linux kernel API already being vendor agnostic, the kernel model is considered as just another vendor model by operators. Thus, some operators demand that the vendor-agnostic API be provided even on native Linux.

These so called vendor-agnostic APIs are a continued throwback to the position of network devices largely as appliances rather than platforms. So the operators continue the pursuit of a uniform data model for the network devices via the vendor-agnostic APIs. No sooner have the operators defined an API, a new feature added by the vendor causes a vendor-specific extension to the API. This happened a lot, for example, with the Management Information Base (MIB) data structures consulted by SNMP. It happened with Openflow. Its happening with Netconf. I am reminded of the Peanuts cartoon’s football gag where Lucy keeps egging Charlie Brown to kick the ball, only to pull it out from under him at the last minute. I fear operators are treated similarly when their vendor-agnostic models are promptly polluted by vendor-specific extensions.

The Reasons Behind the Different Answers

The different models of where network state is stored and the differences in switch silicon programming have two primary causes, in my opinion:

  • The time period when they were first conceived and developed, and the network features of the Linux kernel during that period

  • The business model and its relation to licensing

Timing is relevant because of the evolution of the Linux kernel. At the time of this writing, Linux kernel version 5.0 has been released. Vendors framed their design on the state of the Linux kernel at the time the vendor decided to adopt the Linux kernel. As such, they miss out as evolving versions of Linux provide more features, better stability, and higher performance. The timing of the model, and its relation to the kernel, is as follows:

  1. The first model was developed around 2005. The Linux kernel version was roughly 2.4.29. Cisco’s NXOS had its roots in the OS that was built for its MDS Fibre Channel Switch family. NXOS first started using the Linux kernel around 2002, when Linux 2.2 was the most stable version. Furthermore, NXOS froze its version of the Linux kernel in the 2.6 series because it did not really need much from the kernel.

  2. Arista, which was the primary network vendor pushing for the hybrid model, started operations around 2004. Although I do not know the origins of their Linux kernel model, they largely stayed with Linux kernel version 2.6 (Fedora Release 12, from what I gather) for a long time. The latest versions of their OS at the time of this writing runs Linux kernel version 4.9.

  3. Cumulus Networks, which heralded the beginning of the third NOS model, uses the Debian distribution as their base distribution model. This means that whatever kernel version is running in the Debian distribution they picked, is largely the kernel version they have. At the time of this writing is based on version 4.1, with significant backports of the useful features from the newer Linux kernels. In other words, while the kernel version is supposed to be 4.1, it contains fixes and advanced features picked from the newer Linux kernels. By staying with the base 4.1 kernel, software from the base Debian distribution work as is.

Licensing considerations were another critical reason why the different NOS models developed and persisted. The Linux kernel is under the GNU General Public License (GPL) version 2. So if a NOS vendor patches the kernel, such as by improving the kernel’s TCP/IP stack, they must release it back to the community if they intend to make a publicly usable version of their kernel. In 2002, when the Linux kernel was still relatively nascent, NOS vendor engineers felt that they did not want to help competitors with the improvements of the kernel networking stack. So they switched to designing and developing a user space networking stack. Kernel modules can be proprietary blobs, although the Linux community heavily discourages their use. This idea continued with later NOS vendors, who developed new features (consider VRF, VXLAN and MPLS) in the user space in their own private blob.

Finally, working in an open source community involves a different dynamic from the ones for-profit enterprises are used to. First, finding good talented and knowledgeable kernel developers is hard. Next, unless organizations are sold on the benefits of developing open source software, they are loathe to develop something in the open that they fear potentially enables competitors and reduces their own value proposition. NOS vendoes such as Cumulus Networks built their business on developing all network layer features exclusively in the Linux kernel, and so working closely with the kernel community was part of their DNA.

User Interface

Network operators are used to a command line that is specific to a networking device. The CLI that networking devices provide is not a programmable shell, unlike Linux. It is often a modal CLI i.e. most commands are valid only within a specific context of the CLI. For example, neighbor 1.1.1.1 remote-as 65000 is only valid within the context of configuring BGP. You enter this context by typing several prior commands such as “configure” and “router bgp 64001”. Linux commands are non-modal and programmable. They’re meant to be scripted. Network operators are used to typing “?” rather than <TAB> to get the command completions, not typing “--” or “-” to specify options to a command and so on. What this means is that when dropped into a bash shell, most network operators are too paralyzed to do anything. They type “?” and nothing happens. They hit <TAB> and most likely nothing happens. A network operator not used to bash shell or, more importantly, used to the traditional network device CLI will be at a loss on how to proceed.

There’s a popular saying, “UNIX is user-friendly. It just is very selective about who its friends are”. Just because the Linux kernel makes for a powerful and sophisticated network OS, doesn’t automatically mean that bash can be the only CLI to interact with networking on a Linux platform. Linux grew up as a host OS and as such does not automatically make things easy from a user interface perspective for those who want to use it as a router or a bridge. Furthermore, each routing suite has its own CLI. FRR, the routing suite we use in this book, follows a CLI model more familiar to Cisco. Bird, another popular open source routing suite, uses Juniper’s JunOS-like syntax. gobgp uses a CLI more familiar to people used to Linux and other modern Go-based applications.

Therefore, there is no single, unified and open-source way to interface with networking commands under Linux. There are open source packages put out by Cumulus such as ifupdown2 that work with Debian-based systems such as Ubuntu, but they don’t work with Red Hat based systems. FRR seems like the best bet for a unified, homogeneous CLI. It has an active, vibrant community and people are adding new features to it all the time. If you’re configuring a pure routing network, FRR can be your single unified networking CLI. If you’re doing network virtualization, you have to use native Linux commands or ifupdown2 to configure VXLAN and bond (port-channel) interfaces, though there are efforts underway to add this support to FRR. Pure bridging support such as MSTP will most likely remain outside the purview of FRR. FRR also has some basic support for Netconf, but it has no REST API.

To summarize, using the Linux kernel as the native network OS doesn’t mean that bash shell with the existing command set defined by iproute2 is the only choice. FRR seems the best alternative, though interface configuration is limited to routed interface configuration at the time of this writing.

Comparing the NOS Models with Cloud-Native NOS Requirements

Let’s compare the three models of NOS with regard to the requirements posed by the cloud-native NOS. In the cases below, I’m discussing the requirement as it applies to running on the switch itself.

Table 4-1. Comparing the NOS Models
Requirement User Space Model Hybrid Model Kernel Model Kernel Model with Switchdev

Can be Programmable

Proprietary API

Proprietary API

Mostly open-source

Open-source

Can Add new monitoring agent

NOS-supported only

Mostly NOS-supported only

Mostly yes

Mostly yes

Can Run off-the-shelf distributed app

NOS-supported only

Mostly NOS-supported only

Yes

Yes

Can Replace routing protocol suite

No

No

Yes

Yes

Can Patch software by operator

No

No

On open-source components

On open-source components

Support New Switching Silicon

NOS vendor support

NOS vendor support

NOS vendor support

Switching silicon support

Illustrating the Models with an Example

Let us dive a bit deeper into exploring how the hybrid and kernel space models work when a third-party application is involved. We can ignore the pure user space model here, because every networking application must be provided by the vendor and runs in user space without kernel involvement. In the following sections, I’ll describe applications on Arista’s EOS, even though EOS is not a disaggregated NOS, though other disaggregated NOSes behave in a somewhat similar fashion. I use EOS primarily because it is one I’m familiar with, though not as much as I am with Cumulus.

Ping

Let us see what happens when the ping application is used to see whether an IP address is reachable. We’ll assume that we’re trying to ping 10.1.1.2, which is reachable via switch port 2. In both models, we assume the kernel’s routing table contains this information.

First, in both the kernel and the hybrid models, the switching silicon ports are represented in the Linux kernel as regular Ethernet ports, except they’re named differently. Cumulus names these ports swp1, swp2 etc., while Arista names them Ethernet1, Ethernet2, etc.

In either model, some implementations may choose to implement a proprietary kernel driver that hooks behind the netdevs that represent the switching silicon ports. Other implementations may choose to use the kernel’s Tun/Tap driver (described later in “Tun/Tap Devices”) to boomerang the packets back to the NOS vendor’s user space component, which handles the forwarding. In the case of Cumulus, the proprietary driver is provided by the switching silicon vendor whose silicon is in use on the switch. In case of Arista, the proprietary driver is provided by Arista itself.

The switching silicon vendor defines an internal packet header for packet communication between the CPU complex and the switching silicon. This internal header specifies such things as which port the packet is received from or needs to be sent out of, any additional packet processing that is required of the switching silicon, and so on. This internal header is used only between the switching silicon and the CPU complex and never makes it out of the switch.

In regard to the kernel driver responsible for sending the packet to the switching silicon, when a ping command is issued, the following sequence of actions occur:

  1. The application, ping, opens a socket to send a packet to 10.1.1.2. This is just as it would on a server running Linux.

  2. The kernel’s routing tables point to swp2 as the next hop. The switching silicon’s tx/rx driver delivers the packet to the switching silicon, for delivery to the next hop router.

  3. When the ping reply arrives, the switching silicon delivers the packet to the kernel. The switching silicon’s tx/rx driver receives the packet and the kernel processes the packet as if it came in on a NIC.

  4. The kernel determines the socket to deliver the packet to, and the ping application receives it.

This packet flow is shown in Figure 4-5. As shown, the user space switching silicon driver is not involved in this processing.

cdcn 0405
Figure 4-5. Native Kernel Packet Tx/Rx Flow

While the same sequence of commands occurs on an Arista switch too, the Arista switch requests the switching silicon to treat this packet as if it came in on one of its ports and to forward it appropriately. This request is indicated via the internal packet header mentioned earlier. In other words, the switching silicon redoes the packet forwarding behavior even though the kernel had already done the right thing. This allows Arista to claim that all packets, locally originated or otherwise, are forwarded the same way, via the switching silicon. In other words, the packet forwarding behavior is that of the silicon, not the Linux kernel.

There are other commercial NOSes that use virtual network devices called Tun/Tap to create switch ports inside the kernel that maps to the switching silicon ports. The primary behavior of a Tun/Tap device is to boomerang the packet back from the kernel into the user space. In such a case, the NOS vendor implements packet forwarding in its user space stack when running as a VM. The sequence of steps for the ping now works as shown in Figure 4-6:

  1. The application, ping, opens a socket to send a packet to 10.1.1.2.

  2. The kernel does the packet forwarding for this packet and sends it out port et2, as specified by the kernel routing table.

  3. et2 is a Tun/Tap device, so the packet is sent to the NOS vendor’s user space stack.

  4. The user space stack performs any vendor-specific processing necessary, and then uses the actual driver for the packet switching silicon to transmit the packet out the appropriate front panel port.

  5. When the ping reply is received, the switching silicon delivers the packet to the user space stack.

  6. After processing the packet, the user space stack writes the packet to the appropriate Tun/Tap device. For example, if the reply is received via the switching silicon port 3, the packet is written to et3, the Tun/Tap port corresponding to that port.

  7. The kernel receives this packet and processes it as if it came in on a regular network device such as a NIC.

  8. Ultimately, the kernel hands the packet to the ping process.

cdcn 0406
Figure 4-6. Tun/Tap Packet Tx/Rx Flow

This completes the loop of packet send and receive by a third-party application such as ping. Some NOS vendors who implement the hybrid model also use the Tun/Tap model when running on a real switch.

Running a Different Routing Protocol

If you want to run a different routing protocol suite from the one supplied by the vendor, it’s certainly possible in the kernel-only model. This suite needs to run on Linux and be able to program the state in the Linux kernel using netlink.

In the hybrid mode, if the routing protocol suite needs to program state that is not synchronized with the kernel, it has to be modified to use the NOS vendor’s API. Even if this third-party routing suite wants to program the kernel, it has to adopt the peculiarities of the NOS vendor’s implementation of switching silicon interfaces in the kernel.

What Else Is Left for a NOS to Do?

Besides programming the switching silicon, the NOS also has to read and control the various sensors on the box, such as power, fan, and the LEDs on the interfaces. The NOS also has to discover the box model, the switching silicon version, and so on. Besides all this, the NOS needs to know how to read and program the optics of the interfaces. None of this stuff is standard across white and brite boxes. So a NOS vendor has to write their own drivers and other patches to handle all this. The NOS of course has lots of other things to provide such as a user interface, possibly a programmatic API and so on.

Summary

The primary purpose of this chapter was to educate a network architect or network engineer on cloud-native NOS choices. The reason this education is important is that these choices impose some fundamental consequences on the future of networking and the benefits they yield. The model chosen has effects on what operators can do: what applications they can run on the box, what information they can retrieve, and perhaps what network protocol features are supported.

The philosopher and second President of India, Sarvepalli Radhakrishnan, is quoted to have said, “All men are desirous of peace, but not all men are desirous of the things that lead to peace”. In a similar fashion, not every choice leads to the flourishing of a NOS that is true to the cloud-native spirit. So, being thoughtful about the choices is very important, especially for network designers and architects. Network engineers, too, can cultivate their skillsets for each of the choices. The information in the chapter should help you determine which choices are easy today but possibly gone tomorrow, and which choices truly empower your organization. In the next chapter, we move up the software stack to examine routing protocols and how to select one that’s appropriate for your needs.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required