Chapter 6. Network Performance Limits: Causes and Solutions

Introduction

Networking is fundamental to distributed systems, which are composed of many machines communicating and working together. Networking is also often a bottleneck, so managing network performance is essential.

Network usage is much more like disk than CPU or memory in terms of its impact on multi-tenant distributed systems. Like disk usage, network usage by a given application tends to be extremely “bursty” and can be a bottleneck either before or after the bulk of a process’s useful computation is done. Network usage can also cause significant delays for latency-sensitive applications, even though those applications tend to be much lighter users of network (and disk) than latency-insensitive, data-heavy batch applications.

Bandwidth Problems in Distributed Systems

Whereas disk performance is limited by both total bandwidth and disk seeks (I/O operations per second [IOPS]), network performance is primarily limited by bandwidth alone. However, bandwidth in “the network” is not monolithic; depending on the network topology, bandwidth can be limited at several points, including the following (see Figure 6-1):

  • The network interface card (NIC) or cards on each node

  • The kernel’s ability to access the NIC to send and receive data

  • The switch each node is connected to—generally a top-of-rack (ToR) switch connecting 20–40 nodes together

  • The switch backplane connecting ports on the rack switch

  • The uplink from the rack ...

Get Effective Multi-Tenant Distributed Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.