Skip to Content
View all events

Infrastructure & Ops Superstream: Infrastructure for AI

Published by O'Reilly Media, Inc.

Beginner to advanced content levelBeginner to advanced

GPUs, neoclouds, and next gen data centers

Modern AI and machine learning infrastructure presents unique orchestration challenges that extend far beyond traditional workloads, especially as organizations shift to training and running inference on thousands of GPUs simultaneously. This event addresses the practical realities of managing specialized compute resources, along with the platforms and tools designed to tame their complexity. Join our panel of experts as they explore the shift from traditional computing to modern AI infrastructure, detailing why factors like data center topology and rack-to-rack latency are now more important than ever.

Sessions will provide hands-on guidance for building multi-cloud AI platforms by unifying different computing environments into a single abstraction. You’ll learn how to secure GPU capacity, reduce costs, and eliminate vendor lock-in while maintaining ML engineer productivity. We’ll also cover strategies for building AI-ready data foundations, exploring cloud native storage architectures and hybrid cloud agility to meet unprecedented demands for scale, performance, and resilience at the enterprise level. This event will help DevOps and infrastructure engineers effectively support their organization's AI initiatives.

What you’ll learn and how you can apply it

  • Understand the basics of specialized AI infrastructure and the challenges of massive-scale GPU orchestration, recognizing the critical importance of factors like data center topology and rack-to-rack latency
  • Develop strategies for building unified, cost-effective multicloud AI platforms that secure GPU capacity and eliminate vendor lock-in
  • Learn how to construct AI-ready data foundations and cloud native storage architectures to meet enterprise-level demands for scale, performance, and resilience

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Sam Newman: Introduction (5 minutes)

Sam Newman welcomes you to the Infrastructure & Ops Superstream.

An Introduction to AI Infrastructure – Bryan Oliver (35 minutes)

AI has drastically changed computing infrastructure, forcing large-scale innovation in the GPU computing sector. We no longer train or run inference on just one GPU; instead, we use thousands of GPUs simultaneously for single models. This type of infrastructure is very complex, and we must think differently about how we write and deploy software to it. Factors such as data center topology and even rack-to-rack latency are now more important than ever. Bryan Oliver, author and Thoughtworks engineer, gives you an overview of AI infrastructure along with entry-level breakdowns of its inner workings. You’ll walk away with a greater respect for the scale and complexity of this exploding technology and a general understanding of how you can start to think about using it.

Infrastructure for AI: Managing GPU Workloads with Kubernetes – Saiyam Pathak (35 minutes)

The AI revolution has made GPUs the most valuable resource in modern infrastructure and the hardest to manage at scale. Kubernetes is becoming the platform of choice for AI workloads, but production-grade GenAI comes with real challenges: GPU scarcity, poor utilization, multiteam access, and fragmented deployments across cloud and bare metal. Saiyam Pathak, head of developer relations at vCluster, explores how to build AI-ready Kubernetes platforms that efficiently run today’s LLM-powered workloads. The session covers GPU sharing (time-slicing, MIG), advanced scheduling with Kai Scheduler, and secure multitenancy using vCluster. You’ll also learn how open source tools like vLLM fit into real architectures for scalable inference. You’ll come away with a practical blueprint for delivering high-performance, cost-efficient, multitenant AI infrastructure across Kubernetes environments.

Break (5 minutes)

Distributed Data Architectures for AI: A Deep Dive into Real-World Use Cases – Rob Reid (Sponsored by Cockroach Labs) (30 minutes)

Modern AI use cases such as recommendations, semantic search, RAG, agentic AI, and unified queries place heavy demands on the underlying database. Author and Cockroach Labs’ technical evangelist Rob Reid explores what each use case needs in terms of consistency, performance, scale, and resilience—and what happens when traditional systems are pushed beyond their limits. You’ll learn how distributed SQL architecture provides the most practical path to support AI at enterprise scale, with the ability to eliminate the complex data pipelines that emerge when data is spread across multiple systems. This session will be followed by a 30-minute Q&A in a breakout room. Stop by if you have more questions for Rob.

Protecting and Connecting High-Value AI Workloads on Kubernetes with Cilium – Nico Vibert (35 minutes)

ML models are precious assets and a central part of an organization’s intellectual capital, yet they now face a growing range of attacks, including prompt injection, model poisoning, and data exfiltration through misconfigured services. At the same time, AI and ML workloads demand high bandwidth, low latency, and predictable connectivity, and most modern AI platforms run on Kubernetes to meet these scaling needs. Nico Vibert, technical marketing engineering director at Isovalent and coauthor of Cilium: Up and Running, shows how Cilium’s eBPF-powered networking and security stack protects model serving pipelines while delivering the performance AI workloads require. You’ll learn how to secure interservice communication, enforce workload-aware policy, observe inference traffic, and build reliable high-throughput clusters for sensitive ML applications. This session gives platform teams the tools to safeguard their models and keep AI systems resilient at scale.

Break (5 minutes)

Building Multicloud AI Platforms Without the Pain – Romil Bhardwaj (35 minutes)

GenAI workloads are redefining how AI platforms are built. Teams can no longer rely on a single cloud to satisfy their GPU needs, infra costs are growing, and the productivity of ML engineers is paramount. Going multicloud secures GPU capacity, reduces costs, and eliminates vendor lock-in, but it also introduces operational complexity that can slow down ML teams. Romil Bhardwaj, cocreator of the SkyPilot open source project, provides a guide to building a multicloud AI platform that unifies cloud VMs and Kubernetes clusters across hyperscalers (AWS, GCP, and Azure), neoclouds (CoreWeave, Nebius, Lambda), and on-premises clusters into a single compute abstraction. You’ll learn practical implementation details including workload scheduling strategies, automated cloud selection, and dependency management. This approach lets ML engineers use the same interface for both interactive development sessions and large-scale distributed training jobs, enabling them to focus on building great AI products.

Building AI-Ready Infrastructure: Storage, Performance, and Resilience at Scale – Murat Karslioglu (35 minutes)

As AI transforms enterprise operations, infrastructure leaders must evolve their data foundations to meet unprecedented demands for scale, performance, and resilience. MinIO’s head of global field architects, Murat Karslioglu, explores how modern organizations are building AI-ready infrastructure through cloud native storage architectures, automation, and hybrid-cloud agility. Drawing on real-world use cases, he outlines strategies for unifying data across environments, optimizing performance and cost, and embedding resilience into core systems. You’ll leave with a practical framework for enabling scalable, efficient, and future-proof infrastructure, purpose-built for the AI era.

Sam Newman: Closing Remarks (5 minutes)

Sam Newman closes out today’s event.

Your Hosts and Selected Speakers

  • Sam Newman

    Sam Newman is a technologist focusing on the areas of cloud, microservices, and continuous delivery—three topics which seem to overlap frequently. He provides consulting, training, and advisory services to startups and large multinational enterprises alike, drawing on his more than 20 years in IT as a developer, sysadmin, and architect. Sam is the author of the best-selling Building Microservices (now in its second edition) and Monolith To Microservices, both from O’Reilly, and is also an experienced conference speaker.

    Xlinksearch
  • Bryan Oliver

    Bryan is an engineer who designs and builds complex distributed systems. For the last three years, he’s been focused on platforms, GPU infrastructure, and cloud native at Thoughtworks. Currently, he concentrates on large-scale GPU infrastructure and scheduling techniques. Bryan also coauthored Effective Platform Engineering (Manning) and Designing Intelligent Delivery Systems (O’Reilly) and is a Thoughtworks Technology Radar coauthor and committee member. He speaks at conferences globally, occasionally sits on conference committees, and contributes to open source.

  • Saiyam Pathak

    Saiyam Pathak is head of developer relations at vCluster and the founder of Kubesimplify, which focuses on simplifying cloud native and Kubernetes technologies. Previously, Saiyam worked on many facets of Kubernetes, including machine learning platforms, scaling, multicloud, and managed Kubernetes services at organizations such as Civo, Walmart Labs, Oracle, and HP and has implemented Kubernetes solutions in various enterprises. He’s a Kubestronaut, a CNCF TAG Operational Resilience cochair, and he runs a YouTube channel. When he’s not coding, he contributes to the community by writing blogs and organizing local meetups for Kubernetes and CNCF. Saiyam can be reached on Twitter @saiyampathak.

    linkedinXsearch
  • Nico Vibert

    Nico Vibert is a technical marketing engineering director at Isovalent, the company behind the open source, cloud native solution Cilium. Previously, he worked in many different roles—operations and support, design and architecture, and technical presales—at companies such as HashiCorp, VMware, and Cisco. In his current role, Nico focuses on creating content to make networking a more approachable field and regularly speaks at events like KubeCon, VMworld, and Cisco Live. He has held over 15 networking certifications, including the Cisco Certified Internetwork Expert CCIE (# 22990). Nico is now the lead subject matter expert on the Cilium Certified Associate (CCA) certification and is the coauthor of Cilium: Up and Running (O’Reilly).

  • Romil Bhardwaj

    Romil Bhardwaj is the cocreator of the SkyPilot open source project. He completed his PhD in computer science at UC Berkeley’s RISE Lab, where his research focused on large-scale systems and resource management for machine learning. Romil’s work has led to multiple patents, publications in top conferences, and key awards, including the USENIX ATC 2024 Distinguished Artifact Award and ACM BuildSys 2017 Best Paper. Previously, he was a research fellow at Microsoft Research, where he developed schedulers for distributed machine learning

  • Murat Karslioglu

    Murat Karslioglu is the head of global field architects at MinIO, where he leads worldwide presales architecture for AI-grade object storage systems. A three-time founder, author of two Kubernetes books, and long-time cloud native and storage technologist, he has designed large-scale data platforms for enterprises across finance, telecommunications, retail, and AI/ML. Murat specializes in high-performance infrastructure, Kubernetes-native architectures, and building scalable AI data foundations.

  • Rob Reid

    Rob Reid is Cockroach Labs’ technical evangelist and a software developer from London, England. He has written backend, frontend, and messaging software for law enforcement, travel, finance, commodities, sports betting, telecoms, retail, and aerospace industries. He’s the author of Practical CockroachDB: Building Fault-Tolerant Distributed SQL Databases (Apress) and Understanding Multi-Region Application Architecture and CockroachDB: The Definitive Guide, second edition (O’Reilly), and he has two CockroachDB tattoos.

Skill covered

Artificial Intelligence (AI)

Sponsored by

  • Cockroach Labs  logo