AI Data Center Network Design and Technologies
by Mahesh Subramaniam, Michal Styszynski, Himanshu Tambakuwala
6
Efficient Load Balancing
As we discussed in Chapter 1, “Wonders in the Workload,” AI/ML clusters use different kinds of models, such as large language models (LLM) with both natural language understanding (NLU) and natural language generation (NLG) components. The models are trained at the same time, across multiple GPUs. As a part of the training, GPUs need to synchronize massive amounts of data across nodes, which leads to enormous east–west traffic in the data center. Typically, the number of applications generating traffic in a cluster is low, most of the traffic is RDMA over Converged Ethernet (RoCEv2) traffic, and there is low entropy at the transport layer. Also, in many cases, the traffic in an AI/ML fabric is between source/destination ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access