O'Reilly logo
live online training icon Live Online training

Spotlight on Data: Machine Learning in Production at Google Scale with Todd Underwood

Making machine learning reliable

Topic: Data
Todd Underwood

Watch the video recording of this event.

You can propel your business forward with AI-centric approaches to solving customer needs, but to be successful, you need to deploy your machine learning models at scale. Yet engineers face unique challenges when using machine learning-based products in production environments, such as specialized resource management and measuring user happiness.

Join us for this edition of Spotlight on Data as Todd Underwood, Google’s director and lead for machine learning in site reliability engineering (SRE), explains how to sustainably run machine learning systems at scale. You’ll learn why machine learning is essential to Google’s core functions, providing key advantages across most of Google’s products, including Search, Ads, Payments, Billing, Shopping, and more, and how SRE supports these production machine learning systems. You’ll also discover how the company is working to democratize access to AI by making machine learning technologies available to customers via its Cloud AI products.

O’Reilly Spotlight explores emerging business and technology topics and ideas through a series of one-hour interactive events. You’ll engage in a live conversation with experts, sharing your questions and ideas while hearing their unique perspectives, insights, fears, and predictions for the future.

In every edition of Spotlight on Data, you’ll learn about, discuss, and debate the tools, techniques, questions, and quandaries in the world of data. You’ll discover how successful companies leverage data effectively and how you can follow their lead to transform your organization and prepare for the Next Economy.

What you'll learn-and how you can apply it

  • Key considerations for deploying your machine learning models and services at scale
  • How SRE can best support production machine learning systems

This training course is for you because...

  • You're an engineer or other technical contributor to machine learning projects, and you need to know how to scale and support your services in production environments.


  • Come with your questions for Todd Underwood
  • Have a pen and paper handy to capture notes, insights, and inspiration

Recommended follow-up:

About your instructor

  • Todd Underwood is a site reliability engineering director at Google in Pittsburgh, leading several teams of engineers working on machine learning, Ads, Payments, Billing, Shopping, and data center and cluster infrastructure. Todd’s expertise includes distributed systems, especially for machine learning and AI pipelines, and he has a background in systems engineering and networking. He’s presented work on the future of systems and software reliability engineering at LISA13, LISA16, and SREcon EU15. He’s coauthor of a chapter in the O'Reilly Site Reliability Engineering book and has published a paper in USENIX’s ;login: magazine. Todd has presented work related to internet routing dynamics and relationships at NANOG, RIPE, and various internet interconnection meetings and was previously chair of the NANOG Program Committee and the RIPE Programme Committee.


The timeframes are only estimates and may vary according to how the class is progressing

Monday, August 5, 2019, at 9:00am PT / 12:00pm ET

  • Introduction and presentation (15 minutes)
  • Interactive discussion and Q&A (45 minutes)