Appendix E. Launch Coordination Checklist

This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:

Architecture

  • Architecture sketch, types of servers, types of requests from clients

  • Programmatic client requests

Machines and datacenters

  • Machines and bandwidth, datacenters, N+2 redundancy, network QoS

  • New domain names, DNS load balancing

Volume estimates, capacity, and performance

  • HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out

  • Load test, end-to-end test, capacity per datacenter at max latency

  • Impact on other services we care most about

  • Storage capacity

System reliability and failover

  • What happens when:

    • Machine dies, rack fails, or cluster goes offline

    • Network fails between two datacenters

  • For each type of server that talks to other servers (its backends):

    • How to detect when backends die, and what to do when they die

    • How to terminate or restart without affecting clients or users

    • Load balancing, rate-limiting, timeout, retry and error handling behavior

  • Data backup/restore, disaster recovery

Monitoring and server management

  • Monitoring internal state, monitoring end-to-end behavior, managing alerts

  • Monitoring the monitoring

  • Financially important alerts and logs

  • Tips for running servers within cluster environment

  • Don’t crash mail servers by sending yourself email alerts in your own server code

Security

  • Security design review, security code audit, spam ...

Get Site Reliability Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.