Chapter 6. Network Performance Limits: Causes and Solutions
Networking is fundamental to distributed systems, which are composed of many machines communicating and working together. Networking is also often a bottleneck, so managing network performance is essential.
Network usage is much more like disk than CPU or memory in terms of its impact on multi-tenant distributed systems. Like disk usage, network usage by a given application tends to be extremely “bursty” and can be a bottleneck either before or after the bulk of a process’s useful computation is done. Network usage can also cause significant delays for latency-sensitive applications, even though those applications tend to be much lighter users of network (and disk) than latency-insensitive, data-heavy batch applications.
Bandwidth Problems in Distributed Systems
Whereas disk performance is limited by both total bandwidth and disk seeks (I/O operations per second [IOPS]), network performance is primarily limited by bandwidth alone. However, bandwidth in “the network” is not monolithic; depending on the network topology, bandwidth can be limited at several points, including the following (see Figure 6-1):
The network interface card (NIC) or cards on each node
The kernel’s ability to access the NIC to send and receive data
The switch each node is connected to—generally a top-of-rack (ToR) switch connecting 20–40 nodes together
The switch backplane connecting ports on the rack switch
The uplink from the rack ...