Getting Started with AI Infrastructure and GPU Computing

Beginner

Understand GPU computing, distributed workloads, and efficiency

What you’ll learn and how you can apply it

Understand what GPU computing is and how GPUs have evolved
Gain a high-level understanding of how physical data center layers influence AI computing
Learn how different types of GPU networking increase performance in AI training
Explore the cloud native technologies that support AI workload scheduling and ops
Examine current complexities and problems of AI infrastructure

Course description

AI is changing computing rapidly. We now have to build software that understands how to leverage GPUs, and do so at scale. The GPUs available in the market today are also drastically different (and constantly changing) from the GPUs we had access to even a few years ago. Advancements in data center technology, networking, and GPU chips have all pushed AI software design forward, demanding new ways of thinking and distributing large-scale AI workloads and training jobs.

This course explores the foundational concepts of AI infrastructure in an accessible way, offering a primer for engineers and infrastructure engineers. Bryan Oliver, who works on the AI Labs team at Thoughtworks, takes you through the evolution of GPU computing, explaining why it’s so important to developers, how the hardware is evolving and driving software forward, and how to take advantage of the advancements in this space. You’ll learn what a “GPU cloud” is and dive into the top concepts of a GPU data center; discover how the latest advancements are forcing the industry to rethink what a distributed workload is; and get hands-on with current industry projects to deploy, deliver, and schedule your AI workloads to take full advantage of these advancements.

This live event is for you because...

You’re an engineer or infrastructure engineer who needs to use or provide AI infrastructure.
You already provide AI infrastructure, and you want to improve your skills.

Prerequisites

Kubectl and Helm CLIs installed on your machine
An IDE with a YAML editor and formatter set up on your machine
Access to K8s ecosystem tooling (VS Code extension, K9s, Lens, or similar) recommended but not required
Experience deploying software with at least one cloud provider
An understanding of how cloud computing instances are managed and defined in the context of a hyperscaler
Familiarity with GPUs

Recommended follow-up:

Read AI Systems Performance Engineering (book)
Read Deep Learning at Scale (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Why GPUs matter (30 minutes)

Presentation: How GPUs evolved from video games to AI; a primer on CUDA; running inference on a 70B parameter model and what happens when you submit
Group discussion: Understanding why the GPU is essential over the CPU in AI
Q&A

What is a GPU cloud? (30 minutes)

Presentation: The layers of a GPU data center—from power and cooling to the rack topology
Hands-on exercise: Build a diagram of a GPU data center and identify where key components live
Group discussion: Introduction to the concept of variability—not all GPUs perform equally
Q&A

Why GPU-to-GPU speed matters (35 minutes)

Demonstration: The differences between GPUs connected via standard Ethernet/RoCE and GPUs connected via NVLink
Presentation: The three networking technologies—Ethernet (RoCE), InfiniBand, NVLink; RDMA—memory-to-memory transfer without CPU involvement; mapping answers back to the data center diagram from the last section; where these connections physically exist
Hands-on exercise: Work through two inference scenarios using a provided reference sheet with bandwidth numbers, GPU specs, and cost per hour
Q&A
Break

Topology-aware scheduling (40 minutes)

Presentation: An example of poor GPU placement across rack boundaries; connecting the full chain—physical data center → topology labels → scheduler decisions → workload performance
Demonstration: Topology data from a real cluster—kubectl get nodes --show-labels or equivalent, topology labels (rack, zone, NVLink domains); how the scheduler uses this metadata
Q&A

Deploying and delivering workloads to this infrastructure (35 minutes)

Presentation: From pipelines to topology-aware patterns, how to start thinking about small versus big deployments
Q&A
Break

Hands-on with Kueue (45 minutes)

Presentation: Brief introduction to Kueue—what it is, why it exists, how it extends Kubernetes scheduling for AI workloads
Hands-on exercise: Deploy Kueue to a Kubernetes cluster (provided sandbox or local environment); create topology CRDs representing data center structure; modify a job spec to target Kueue's topology constraints; submit the job and observe scheduling behavior
Group discussion: Review two or three solutions from attendees; troubleshoot common issues
Q&A

Synthesis and wrap-up (25 minutes)

Presentation: Architectural overview of the OSS landscape; where Kueue fits alongside Volcano and YuniKorn, and when you might choose each; vLLM, NVIDIA Triton Inference Server, Text Generation Inference (TGI), and Ray Serve; how these projects optimize GPU utilization for serving workloads; resources and next steps; trends to watch (GPU virtualization, optical interconnects, evolving chip architectures)
Group discussion: How scheduling and serving projects complement each other in production architectures
Q&A

Your Instructor

Bryan Oliver
Bryan is an engineer who designs and builds complex distributed systems. For the last three years, he’s been focused on platforms, GPU infrastructure, and cloud native at Thoughtworks. Currently, he concentrates on large-scale GPU infrastructure and scheduling techniques. Bryan also coauthored Effective Platform Engineering (Manning) and Designing Intelligent Delivery Systems (O’Reilly) and is a Thoughtworks Technology Radar coauthor and committee member. He speaks at conferences globally, occasionally sits on conference committees, and contributes to open source.

search

Skills covered

Artificial Intelligence (AI)

Computer Vision
Deep Learning
Chips & Processors

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills