Chapter 4. Preflight Checklist
The lessons learned here are specific to each of the case studies examined, but may be modified to fit your individual needs. We’ve also listed 10 general guidelines to keep in mind when implementing a large-scale infrastructure change at your organization. Regardless of the size of your organization, applying these key takeaways may help mitigate the challenges of rolling out a large infrastructure change:
Establish a core team (if it doesn’t yet exist) to manage the infrastructure change in the company
Staff the effort with the right people from the beginning to ensure projects smoothly launch and land. At a minimum, there should be full-time engineers (to build code for the migration tools and assist with answering questions), technical project managers (who facilitate communication, tracking, and meeting deadlines), and an executive sponsor (who helps push this change at the top, to ensure it gets prioritized).
Pilot with the more technically savvy, low-risk customers first
These customers are more aware of what features they need and can provide useful feedback to improve the migration before a large rollout. In addition, try to select the customers that are considered to be low risk (i.e., unexpected issues would not stop operations for them).
Understand trade-offs up front as much as possible
While establishing a core team is critical, it’s not always possible to have dedicated people working on the migration project or, perhaps, the program complexity was not well understood at the outset. Clarify at the start the lost opportunity costs in the project, such as delivery delays, low quality project communication, or unmanaged migration risks for critical services. By doing such clarification up front, the team can identify and proactively accept the risks brought in due to these constraints.
Understand your customer requirements
Before the change project starts, gather user requirements to see what specifically they want in a system and for what purpose. Even if their use cases will not be built in the same way in the new system, it helps to ensure you’re building the right tools for the right audience, to ensure a smooth migration. As you gather the requirements, you may also come across unique corner cases. By gathering corner-case situations up front, you frontload your risk and ensure you have sufficient slack in the schedule, to either prioritize the requirements or collaborate with the team to adjust their workflows so that it works with the new system.
Publish your plan of record
A plan of record confirms the project plan and key decisions, as agreed on by the project stakeholders. This includes, at a minimum, a glossary of key vocabulary, project goals, project timeline, and key milestones, with assigned owners. It’s essential to have one source of truth to revisit, when plans change. Within Google, we share this plan of record broadly, both inside and outside the project team. In doing so, this provides transparency in what teams can expect from the project and transparency for how decisions were made.
Push the migration out in phases
The migration itself is a disruption to service operations. Even with a plan in place, significant risks still exist. Staging the migration in phases relevant to the scale of your organization is an effective way of implementing the change. As issues emerge during earlier phases, you have time to update tools, techniques, and processes before the risks impact more services in later phases. An example of a phased migration approach could be early/alpha testing, voluntary migration, assisted migration, forced migration, and then deprecation of the old service.
Automate as much of the manual, repeatable process as possible
Depending on how large the infrastructure change is and how many people it affects, automating relevant processes saves time for engineers, so they can focus on more complex issues, and avoids burdening users with manual and toilsome work. For example, in Moonshot’s case, a migration scheduling tool was built to identify an appropriate time for the migration to take place. This tool took into account cluster maintenance time and launch dates for the service. Think about how much time it would take for you to build an automated tool to perform a process, and how much time you would spend manually performing the process for each service. This helps you determine the return on investment (ROI) of creating a tool versus manually handling the process.
Test early and often
Having a testing environment setup for users, to test whether their service functions on the new infrastructure, is critical for uncovering and mitigating technical risks. Testing should simulate, as closely as possible, the behavior that the production environment offers, when services are migrated. Any deviation from that behavior exposes more risk.
Communicate early and often
For a large-scale infrastructure change, issues may crop up at any time. Those leading the implementation of the change must continually communicate early and often about this change and through the right channels. People are often frustrated when any change occurs, and more so if it was not communicated clearly enough or to the right people. Therefore, communicate early and often to reduce the resistance to the change. Some examples of how you can do this include creating an FAQ, sending announcements to relevant internal engineering newsletters or mailing lists, presenting at company-wide all-hands meetings, offering one-on-one consulting, and creating a landing project page containing relevant information.
Create appropriate escalation and exception procedures
It’s not uncommon for a service to need an exception or extension to a large-scale infrastructure project. This occurs because a change may not have the features a team needs, because there are conflicting and committed project deadlines, or for other valid reasons. Regardless of the underlying context, providing escalation and exception procedures ensures that teams are aware of the proper channels, to communicate and collaborate with the change team. When creating these procedures, gather details such as the name of the service requesting the extension, how much more time they would need, and the justification for such an extension.