The evolution of data center networks

Five Questions for Dinesh Dutt on the changing relationship between network and computer.

By Nikki McDonald and Dinesh Dutt

June 15, 2017

Server room (source: kewl)

I asked Dinesh Dutt, chief scientist at Cumulus Networks and author of BGP in the Datacenter, to discuss how data centers have changed in recent years, new tools and techniques for network engineers, and what the future may hold for data center networking. Here are some highlights.

How have data centers evolved over the past few years?

Modern data centers have come a long way from when I first began working on them in 2007. Pioneers such as Google and Amazon started a trend that many others now try to emulate.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

First, more people are embracing the Clos network topology and routing as the fundamental rubric of the modern data center. This is true of even conservative enterprises. The biggest impediment to making this change is having legacy applications that still assume bridging as the network interconnect, and assume broadcast and multicast as common application discovery and heartbeat mechanisms. Modern data centers have either entirely banished multicast or relegated it to a rack level service. And they charge more for these services on a larger scale.

Second, automation is more ubiquitously embraced as a mandatory requirement rather than a nice to have. While some are still unsure how to start, their desire for automation is clear and is evidenced by tools such as Ansible being enhanced to deal with automating networking equipment.

Third, disaggregated solutions or white box switching, where the hardware and software are acquired separately from different vendors, is steadily gaining ground. Traditional vendors are starting to dip their toes into this pool, like server vendors did in the late 90s world of PCs and x86.

Along with disaggregated switches, people are starting to understand the logic, benefits and simplicity of fixed-form factor switches instead of large modular chassis.

From an applications perspective, I see the rise of microservices and containers as a fast growing trend. People are pushing features such as security into the application instead of letting it be part of networking. Needless to say, some basic traffic isolation is still provided by networks, but not to the extent that older solutions prescribed.

What are some of the traditional approaches to troubleshooting networks?

Traditionally, network operators troubleshoot using tools such as ping and traceroute and open multiple terminal windows, each connected to a different node, to diagnose each box. In other words, they try to build a fabric-wide picture using a box-by-box approach. This traditional approach is still true of many modern data centers because of the dearth of tools in this area. Networking is still largely an appliance-based model rather than a platform model, preventing anybody but the vendor from developing the appropriate tools. Vendors are not operators and often lack the perspective of network operators.

A network operator’s life is not easy, and trends such as SDN have not really helped. While we have started automating configuration management, we have not gone much beyond that. Don’t get me wrong; the largest data centers have modernized their network operations to extend automation beyond configuration, but not many others.

When things go wrong, the network operator is typically manually running a checklist to troubleshoot the problem. We need to codify and automate network operations beyond just configuration.

What are some of the new tools and techniques that network engineers should be aware of?

Ping and traceroute have started supporting multipathing, which is the availability of multiple paths to reach a destination and is a characteristic of modern data centers. There are also newer tools such as scamper that handle the presentation of information somewhat better than ping and traceroute. And large scale network operators use tools such as Pingmesh and Norad to quickly troubleshoot problems.

But it is the rise of open networking that has the greatest potential. I say this because system administrators have many tools to help identify problems, whereas network administrators have tools limited by SNMP. With Linux powering routers and switches, networking devices now offer an open platform that allows people to treat routers and switches much like servers that have many network interface cards. The accessibility of this type of platform empowers network administrators to develop their own breed of tools. Even simple metrics gathering is becoming possible without SNMP. But this area is immature and still developing. I have been working on a tool to help validate and troubleshoot networks that I’ll cover in the talk I’m giving at Velocity.

What do you see as the future of data center networking?

In a single word: consilience. I see a world where network and compute (and storage) are managed similarly, with each leveraging the tools and expertise of the other domains. With an open platform powering each, we no longer have to solve the same problem twice (once for networking, and once for compute). Often, the two solutions are vastly different for no reason other than history. Computing was opened up when Linux replaced the proprietary server OSes; this changed the application landscape as we know it. I think solutions such as clouds and the Google search would have been harder to invent if not for the widespread use of Linux. For too long people have innovated around networks, not with them. If networking can be opened up as computing has been, and if people can innovate with the network rather than around it, I think networking and computing as we know it can change dramatically.

You’re planning to be at the O’Reilly Velocity Conference in New York. What presentations are you looking forward to attending while there?

I am unfortunately only in New York on Tuesday, so the sessions I attend will be limited because of that, but I am looking forward to these three: Load Balancing, Consistent Hashing, and Locality by Andrew Rodland, Scalable, Fluent Time Series Data Analysis by Leif Walsh, and Persistent SRE Anti-Patterns by Blake Bisset and Jonah Horowitz.

Post topics: Infrastructure