book

AI Data Center Network Design and Technologies

Name: AI Data Center Network Design and Technologies
ISBN: 9780135436370

by Mahesh Subramaniam, Michal Styszynski, Himanshu Tambakuwala

February 2026

Intermediate to advanced

384 pages

12h 48m

English

Addison-Wesley Professional

Read now

Unlock full access

Cover Page
About This eBook
Halftitle Page
Title Page
Copyright Page
Dedication Page
Contents
Foreword
Preface
Acknowledgments

About the Authors
What You’ll Find Inside
Credits
1. Wonders in the Workload
What’s New in AI Data Center WorkloadsThe Life Cycle of an AI ModelTraining an AI ModelParallelismJob Completion Time (JCT)Tail LatencySummaryTest Your Knowledge
2. “The Common-Man View” of AI Data Center Fabrics
Training vs. Inference AI Data CentersInfiniBand vs. Ethernet for AI Training Data CentersEthernet Hardware Switches and Advanced Software FeaturesHandling Elephant FlowsLoad-Balancing TechniquesCongestion Management and Mitigation TechniquesSummaryTest Your Knowledge
3. Network Design Considerations
Background IntroductionTraining Data Center ArchitectureRail-Optimized Design (ROD)Rail-Unified Design (RUD)Rack DesignScheduled FabricTopologiesInference Data Center ArchitectureMulti-Planar Scale-Out ArchitecturesSummaryTest Your KnowledgeReferences
4. Optics and Cable Management
Scaling Optics for AI ClustersChallenges in Optical InnovationPacket FlowTransmission ModesTransceiver TypesCable and Connector TypesStandardsFurther Innovations in OpticsSummaryTest Your KnowledgeReferences
5. Thermal and Power Efficiency Considerations
Thermal Footprints in AI Data CentersAirflow OptionsLiquid CoolingSummaryTest Your KnowledgeReferences
6. Efficient Load Balancing
Per-Flow Load BalancingPer-Packet Load BalancingLoad-Balancing Mechanism ComparisonSummaryTest Your Knowledge
7. RoCEv2 Transport and Congestion Management
Congestion PointsExplicit Congestion Notification (ECN)Data Center Quantized Congestion Notification (DCQCN)Source Flow Control (SFC)Congestion SignalingSummaryTest Your Knowledge
8. IP Routing for AI/ML Fabrics
Dynamic IP Routing OptionseBGP Underlay for Three-Stage/Five-Stage Fabric for an AI Data CenterMulti-tenancy for an AI/ML Cluster Data Center NetworkMicrosegmentation and Multi-tenancy for an AI/ML Data CenterExtending IP Routing to the ServerTraffic Engineering in the AI Data Center FabricSegment Routing and SRv6 for AI/ML FabricsSummaryTest Your KnowledgeReferences
9. Storage Network Design and Technologies
The AI Data Center Life Cycle and Storage NetworksStorage Network Design TypesBlock, Object, and File Storage SystemsNVMe-oF for Block-Level AccessNVMe-o-RDMA/RoCEv2 State MachineHigh-Performance File SystemsGPUDirect StorageSummaryTest Your KnowledgeReferences
10. AI Network Performance KPIs
Significance of Performance BenchmarkingMLCommons for AI Data CentersMLCommons InitiativesMLCommons Benchmarking SuitesBenchmarking a Data Center for Machine LearningSummaryTest Your KnowledgeReferences
11. Monitoring and Telemetry
Exploring Monitoring OptionsNetwork Monitoring in an AI/ML Data Center NetworkIn-Band Flow Analyzer (IFA)Corrective ActionsSummaryReference
12. Ultra Ethernet Consortium (UEC)
UEC Developments and Working GroupsUEC Key TerminologyThe UEC and Network ArchitecturesA New Protocol StackData Plan: Packet Forwarding OptionsPacket Delivery ModesCongestion Management (CM) in the UEC SpecificationPacket Trimming and Fast RetransmissionsLink Layer Reliability (LLR) MechanismIn-Network Collectives (INC) and xCCLManagement and OrchestrationInteroperability and Backward CompatibilityCompliance and CertificationUEC Challenges and Future DirectionsComparing UEC to InfiniBand and RoCEv2SummaryTest Your KnowledgeReferences
13. Scale-Up Systems
Key Building Blocks of Scale-Up SystemsScale-Up Ethernet Transport (SUE-T)Ultra Accelerator Link (UALink)Memory Coherence in Scale-Up SystemsScale-Up Systems: Key Differences and SimilaritiesSummaryTest Your KnowledgeReferences
14. Conclusion
DC Network Role for AICaveats and ChallengesFuture DevelopmentsFinal RemarksReferences
Appendix A. Questions and Answers
Chapter 1: Wonders in the WorkloadChapter 2: “The Common-Man View” of AI Data Center FabricsChapter 3: Network Design ConsiderationsChapter 4: Optics and Cable ManagementChapter 5: Thermal and Power Efficiency ConsiderationsChapter 6: Efficient Load BalancingChapter 7: RoCEv2 Transport and Congestion ManagementChapter 8: IP Routing for AI/ML FabricsChapter 9: Storage Network Design and TechnologiesChapter 10: AI Network Performance KPIsChapter 11: Monitoring and TelemetryChapter 12: Ultra Ethernet Consortium (UEC)
Appendix B. Acronyms
Index
Where are the companion content files?

Content preview from AI Data Center Network Design and Technologies

6 Efficient Load Balancing

As we discussed in Chapter 1, “Wonders in the Workload,” AI/ML clusters use different kinds of models, such as large language models (LLM) with both natural language understanding (NLU) and natural language generation (NLG) components. The models are trained at the same time, across multiple GPUs. As a part of the training, GPUs need to synchronize massive amounts of data across nodes, which leads to enormous east–west traffic in the data center. Typically, the number of applications generating traffic in a cluster is low, most of the traffic is RDMA over Converged Ethernet (RoCEv2) traffic, and there is low entropy at the transport layer. Also, in many cases, the traffic in an AI/ML fabric is between source/destination ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780135436370

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

AI Data Center Network Design and Technologies

by Mahesh Subramaniam, Michal Styszynski, Himanshu Tambakuwala

6

Efficient Load Balancing

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.