Skip to Content
View all events

AI and LLM Deployment with Kubernetes

Published by O'Reilly Media, Inc.

Beginner content levelBeginner

Get started building an infrastructure for hosting GenAI on Kubernetes

What you’ll learn and how you can apply it

  • Learn to set up and manage a Kubernetes infrastructure tailored for hosting AI applications
  • Discover the best practices for configuring Kubernetes resources to enhance the performance of AI applications
  • Gain practical skills by running an Ollama-based AI application on Kubernetes, using real-world scenarios

Course description

Companies today are increasingly reliant on LLMs for internal inference and chat applications. To effectively deploy these AI-driven solutions, a flexible and scalable infrastructure is essential, and Kubernetes is the premier choice for this. This two-day course is designed to equip you with the skills you need to host and manage AI applications using Kubernetes.

Through hands-on practice and real-world scenarios, Kubernetes expert and trainer Sander van Vugt takes you through the requirements for setting up a Kubernetes infrastructure to host AI, and the necessary Kubernetes components. You’ll also practice running a simple AI application based on Ollama in Kubernetes, using all the Kubernetes resources typically seen in an AI infrastructure and positioning them to effectively support internal GenAI initiatives powered by LLMs.

This live event is for you because...

  • You’re looking for a scalable platform for running AI applications.
  • You want to integrate GPUs in Kubernetes.
  • You want to learn how to run an AI application on top of Kubernetes.
  • You’re a DevOps, data, or AI engineer, or an infrastructure architect who’s working with LLMs.

Prerequisites

  • A basic understanding of Kubernetes
  • If you want to follow along with the GPU-based parts of this course, you should have at least one virtual or physical server that has an NVIDIA GPU. Cloud instances with GPU resources offered by common cloud platforms are supported. Apart from that, two non-GPU-based servers are required.
  • If you want to follow along with building the Kubernetes cluster, you should have at least three Ubuntu Server LTS-based virtual machines.
  • The virtual machines should meet the following minimal requirements: 2 CPUs, 4 GB RAM, 40 GB disk space. Optional: One or more virtual machines with a GPU.

Recommended Preparation

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Day 1

Requirements for setting up Kubernetes to service LLMs (60 minutes)

  • Presentation: Base Kubernetes node requirements; offering GPU access; considerations for on-premises or public cloud-based Kubernetes; exploring setup options for servicing LLMs on premises or in public cloud
  • Q&A
  • Break

Building a Kubernetes cluster to service LLMs (70 minutes)

  • Presentation: Installing GPU drivers; configuring the container runtime for GPU usage; building the base Kubernetes cluster; installing the GPU operator
  • Hands-on exercise: Set up the base Kubernetes cluster
  • Q&A
  • Break

Understanding resources for servicing LLMs (60 minutes)

  • Presentation: Analyzing an application that services LLMs; running applications in Pods, Deployments, and DaemonSets; providing access to applications with Services, Ingress, and Gateway API; providing access to storage with PV, PVC, and StorageClass
  • Q&A
  • Break

Configuring the GPU operator (50 minutes)

  • Presentation: Exploring GPU operator options and functionality; monitoring GPU operator components; configuring the GPU operator for using timeslices
  • Hands-on exercise: Explore using GPU timeslices
  • Q&A

Day 2

Presenting scalable storage to the application (70 minutes)

  • Presentation: Using Pod volumes or persistent volumes; setting up the application for using Pod volumes; setting up the application for using persistent volumes
  • Hands-on exercise: Provide persistent storage to the LLM-based application
  • Q&A
  • Break

Running inference workloads on Kubernetes (70 minutes)

  • Presentation: Understanding what is needed; fetching the LLM using Jobs or initPods; running a Deployment with the vLLM inference server; using resources and NodeSelectors; troubleshooting
  • Hands-on exercise: Run vLLM on Kubernetes
  • Q&A
  • Break

Providing access to the application (70 minutes)

  • Presentation: Setting up the service resource; configuring Gateway API
  • Hands-on exercise: Configure all that is needed to initiate a chat session with the application
  • Q&A
  • Break

AI workload scalability (10 minutes)

  • Presentation: Why Horizontal Pod Autoscaler doesn’t work; understanding inference job scalability requirements; planning for efficient resource usage

Kubeflow quick overview (10 minutes)

  • Presentation: Understanding Kubeflow use cases; Kubeflow component overview

Wrap-up and Q&A (10 minutes)

Your Instructor

  • Sander van Vugt

    linkedinXlinksearch

Skills covered

  • Kubernetes
  • .NET
  • Knative
  • Serverless Architecture