Chapter 1. Introduction

How do you as an engineering practitioner know if the change project you are managing qualifies as “infrastructure change”?

  • Are the terms upgrade, migration, or decommission part of the change definition?

  • Does your change affect multiple teams, organizations, products, and services within the company?

  • Does your change impact engineering capabilities to maintain current plans, configurations, processes, or to apply software or policy changes?

If your answer to the above questions is yes, you are rolling out a large-scale infrastructure change.

We define infrastructure change management (ICM) as the execution of a planned, large-scale infrastructure change in order to increase project velocity, reduce cost, and lessen the overall pain inflicted on affected teams and customers.

A cliché, though apt, idiom for this kind of large-scale infrastructure change is “building the jet while flying it.” Keeping the jet in flight and on course while building and rebuilding it requires an enormous amount of people to work as a team. If an engine dies, the crew needs to assess the situation, determine a corrective course of action, and ensure the safety of passengers onboard while communicating the issue in the right way, at the right frequency, to avoid widespread panic.

Large-scale infrastructure change works the same way, requiring coordination and communication with many teams, good processes and documentation, risk identification and management, monitoring, and tracking of the change progress. You can’t ignore the low-probability but highly catastrophic events that can crop up mid-flight. Exercises like the Wheel-of-Misfortune1 (disaster role playing) and DiRT2 (annual event to push production systems to limit and inflict actual outages) are good ways to uncover these. The SRE Workbook also describes a number of organizational change management frameworks that may be useful to consider alongside infrastructure change.3

Infrastructure Change Management

These changes require strong processes and project management to ensure decisions are well-informed and communicated. The ICM program at Google, consisting of a dedicated team of technical program managers (TPMs), does just that: centrally driving migrations, deprecations, and other large-scale changes to infrastructure. Programs that ICM supports go through the following life cycle:

Concept Phase

Someone has an idea for a large-scale infrastructure change that could benefit from ICM support.

Backlog Phase

ICM performs a feasibility assessment of the concept proposal, compares its effort costs against the expected outcome, and ranks it in priority against other initiatives.

Planning Phase

People build an actionable project plan, publish target schedules, create objectives and key results for impacted teams, define key milestones and deliverables, and identify stakeholders and staffing. The goal of this phase is to take a concept proposal from the backlog and turn it into a work-about execution plan.

Execution Phase

In this phase, the project is under active execution. Impacted teams have product area (PA)–wide objectives and key results are centered around compliance with the program’s goals.

ICM also provides dashboards to track infrastructure change progress across all active programs, as well as a tool called Assign-o-Matic that quickly maps production groups to best contacts. Many groups aren’t associated with a product nor do they point to a human. Assign-o-Matic’s heuristics for identifying the best contact solve the difficult and time-consuming problem of finding the right owner. Driving over a dozen active infrastructure change programs, ICM manages the complex network of dependencies that exist between them, so that the jet stays aloft with minimal-to-no impact to passengers.

One such program that ICM supported was the two-year MapReduce deprecation. MapReduce,4 a flagship framework for large-scale data processing at Google, had been in maintenance mode since 2013. However, MapReduce usage continued to increase and by August 2017, users processed nearly 30 EiB of input and produced over 7 EiB of data. The goal of this infrastructure change program was to migrate all users off the MapReduce backend onto Flume, a higher-level application programming interface (API) built on top of MapReduce, which simplified expression of large-scale data computations.

Flume made it easier to build data-processing pipelines. The design goal was to make pipeline creation easier and more efficient. Rather than programming and tying together a series of independent MapReduce stages, we wrote one program with Flume and let it handle the execution details. By abstracting away from the low-level infrastructure, we did not need to work with all the underlying primitives—the panoply of data storage formats, parallel execution primitives, and job controller systems available at Google. Flume took care of all that, providing numerous benefits including reduced runtimes, less maintenance, and the ability for Google to focus on supporting a single platform for all users.

In August 2018, MapReduce was deprecated and replaced by Flume. During 2018, 50% of 30-day active build targets migrated off MapReduce and, by September 2019, over 45% of the remaining active targets were off MapReduce. As of 2019, Flume was rolled out to over 99% of C++ and Java pipelines, and the Flume support rotation was staffed with 12 engineers. Migrating to a new API and execution environment came at a cost to users. ICM helped minimize this cost and drive the migration alongside the many others in flight.

In this report, we provide two case studies on large infrastructure changes at Google: a two-year effort to migrate all of the company’s systems from Google File System (GFS) to Colossus and a six-year effort to remove local disk storage for all jobs and move toward Diskless compute nodes. For each of these case studies, we provide an overview, the project’s impact, the tools and processes used to manage the change, as well as individual lessons learned after each completed change. We conclude with a collection of key takeaways to consider when implementing a large-scale infrastructure change at your own organization. We hope that by sharing what worked and didn’t work for us in these changes, other organizations may learn from our best practices and prepare for any anticipated risks that might occur along the way.

1 For more info, see Chapter 28 of Site Reliability Engineering.

2 For more info, see Chapter 33 of Site Reliability Engineering.

3 Consider using any of the frameworks referenced in the SRE workbook chapter Organizational Change Management in SRE.

4 More information is available online.

Get Case Studies in Infrastructure Change Management now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.