Skip to Content
View all events

Getting Started with AI Infrastructure and GPU Computing

Published by O'Reilly Media, Inc.

Beginner content levelBeginner

Understand GPU computing, distributed workloads, and efficiency

What you’ll learn and how you can apply it

  • Understand what GPU computing is and how GPUs have evolved
  • Gain a high-level understanding of how physical data center layers influence AI computing
  • Learn how different types of GPU networking increase performance in AI training
  • Explore the cloud native technologies that support AI workload scheduling and ops
  • Examine current complexities and problems of AI infrastructure

Course description

AI is changing computing rapidly. We now have to build software that understands how to leverage GPUs, and do so at scale. The GPUs available in the market today are also drastically different (and constantly changing) from the GPUs we had access to even a few years ago. Advancements in data center technology, networking, and GPU chips have all pushed AI software design forward, demanding new ways of thinking and distributing large-scale AI workloads and training jobs.

This course explores the foundational concepts of AI infrastructure in an accessible way, offering a primer for engineers and infrastructure engineers. Bryan Oliver, who works on the AI Labs team at Thoughtworks, takes you through the evolution of GPU computing, explaining why it’s so important to developers, how the hardware is evolving and driving software forward, and how to take advantage of the advancements in this space. You’ll learn what a “GPU cloud” is and dive into the top concepts of a GPU data center; discover how the latest advancements are forcing the industry to rethink what a distributed workload is; and get hands-on with current industry projects to deploy, deliver, and schedule your AI workloads to take full advantage of these advancements.

This live event is for you because...

  • You’re an engineer or infrastructure engineer who needs to use or provide AI infrastructure.
  • You already provide AI infrastructure, and you want to improve your skills.

Prerequisites

  • Kubectl and Helm CLIs installed on your machine
  • An IDE with a YAML editor and formatter set up on your machine
  • Access to K8s ecosystem tooling (VS Code extension, K9s, Lens, or similar) recommended but not required
  • Experience deploying software with at least one cloud provider
  • An understanding of how cloud computing instances are managed and defined in the context of a hyperscaler
  • Familiarity with GPUs

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Why GPUs matter (30 minutes)

  • Presentation: How GPUs evolved from video games to AI; a primer on CUDA; running inference on a 70B parameter model and what happens when you submit
  • Group discussion: Understanding why the GPU is essential over the CPU in AI
  • Q&A

What is a GPU cloud? (30 minutes)

  • Presentation: The layers of a GPU data center—from power and cooling to the rack topology
  • Hands-on exercise: Build a diagram of a GPU data center and identify where key components live
  • Group discussion: Introduction to the concept of variability—not all GPUs perform equally
  • Q&A

Why GPU-to-GPU speed matters (35 minutes)

  • Demonstration: The differences between GPUs connected via standard Ethernet/RoCE and GPUs connected via NVLink
  • Presentation: The three networking technologies—Ethernet (RoCE), InfiniBand, NVLink; RDMA—memory-to-memory transfer without CPU involvement; mapping answers back to the data center diagram from the last section; where these connections physically exist
  • Hands-on exercise: Work through two inference scenarios using a provided reference sheet with bandwidth numbers, GPU specs, and cost per hour
  • Q&A
  • Break

Topology-aware scheduling (40 minutes)

  • Presentation: An example of poor GPU placement across rack boundaries; connecting the full chain—physical data center → topology labels → scheduler decisions → workload performance
  • Demonstration: Topology data from a real cluster—kubectl get nodes --show-labels or equivalent, topology labels (rack, zone, NVLink domains); how the scheduler uses this metadata
  • Q&A

Deploying and delivering workloads to this infrastructure (35 minutes)

  • Presentation: From pipelines to topology-aware patterns, how to start thinking about small versus big deployments
  • Q&A
  • Break

Hands-on with Kueue (45 minutes)

  • Presentation: Brief introduction to Kueue—what it is, why it exists, how it extends Kubernetes scheduling for AI workloads
  • Hands-on exercise: Deploy Kueue to a Kubernetes cluster (provided sandbox or local environment); create topology CRDs representing data center structure; modify a job spec to target Kueue's topology constraints; submit the job and observe scheduling behavior
  • Group discussion: Review two or three solutions from attendees; troubleshoot common issues
  • Q&A

Synthesis and wrap-up (25 minutes)

  • Presentation: Architectural overview of the OSS landscape; where Kueue fits alongside Volcano and YuniKorn, and when you might choose each; vLLM, NVIDIA Triton Inference Server, Text Generation Inference (TGI), and Ray Serve; how these projects optimize GPU utilization for serving workloads; resources and next steps; trends to watch (GPU virtualization, optical interconnects, evolving chip architectures)
  • Group discussion: How scheduling and serving projects complement each other in production architectures
  • Q&A

Your Instructor

  • Bryan Oliver

    Bryan is an engineer who designs and builds complex distributed systems. For the last three years, he’s been focused on platforms, GPU infrastructure, and cloud native at Thoughtworks. Currently, he concentrates on large-scale GPU infrastructure and scheduling techniques. Bryan also coauthored Effective Platform Engineering (Manning) and Designing Intelligent Delivery Systems (O’Reilly) and is a Thoughtworks Technology Radar coauthor and committee member. He speaks at conferences globally, occasionally sits on conference committees, and contributes to open source.

Skills covered

  • Artificial Intelligence (AI)
  • Computer Vision
  • Deep Learning
  • Chips & Processors