Software engineering for operations

Google's SRE team on the software engineering skills they want from their site reliability engineers.

By David Helstroom, Trisha Weir, Evan Leonard and Kurt Delimon
April 26, 2016
Architecture Architecture (source: Pixabay)

Software Engineering in SRE

Ask someone to name a Google software engineering effort and
they’ll likely list a consumer-facing product like Gmail or Maps;
some might even mention underlying infrastructure such as Bigtable or
Colossus. But in truth, there is a massive amount of behind-the-scenes
software engineering that consumers never see. A number of those
products are developed within SRE.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Google’s production environment is—by some measures—one of the most complex machines humanity has ever built. SREs have firsthand experience with the intricacies of
production, making them uniquely well suited to develop the
appropriate tools to solve internal problems and use cases
related to keeping production running. The majority of these tools are
related to the overall directive of maintaining uptime and keeping
latency low, but take many forms: examples include binary rollout
mechanisms, monitoring, or a development environment built on dynamic
server composition. Overall, these SRE-developed tools are
full-fledged software engineering projects, distinct from one-off
solutions and quick hacks, and the SREs who develop them have adopted
a product-based mindset that takes both internal customers and a
roadmap for future plans into account.

Why Is Software Engineering Within SRE Important?

In many ways, the vast scale of Google production has necessitated
internal software development, because few third-party tools are designed
at sufficient scale for Google’s needs. The company’s history of
successful software projects has led us to appreciate the benefits of
developing directly within SRE.

SREs are in a unique position to effectively develop internal software
for a number of reasons:

  • The breadth and depth of Google-specific production knowledge
    within the SRE organization allows its engineers to design and create
    software with the appropriate considerations for dimensions such as
    scalability, graceful degradation during failure, and the ability to
    easily interface with other infrastructure or tools.

  • Because SREs are embedded in the subject matter, they easily
    understand the needs and requirements of the tool being developed.

  • A direct relationship with the intended user—fellow SREs—results
    in frank and high-signal user feedback. Releasing a tool to an
    internal audience with high familiarity with the problem space means
    that a development team can launch and iterate more quickly. Internal
    users are typically more understanding when it comes to minimal UI and
    other alpha product issues.

From a purely pragmatic standpoint, Google clearly benefits from
having engineers with SRE experience developing software. By
deliberate design, the growth rate of SRE-supported services exceeds
the growth rate of the SRE organization; one of SRE’s guiding
principles is that “team size should not scale directly with service
growth.” Achieving linear team growth in the face of exponential
service growth requires perpetual automation work and efforts to
streamline tools, processes, and other aspects of a service that
introduce inefficiency into the day-to-day operation of production.
Having the people with direct experience running production systems
developing the tools that will ultimately contribute to uptime and
latency goals makes a lot of sense.

On the flip side, individual SREs, as well as the broader SRE
organization, also benefit from SRE-driven software development.

Fully fledged software development projects within SRE provide career
development opportunities for SREs, as well as an outlet for engineers
who don’t want their coding skills to get rusty. Long-term project
work provides much-needed balance to interrupts and on-call work, and
can provide job satisfaction for engineers who want their careers to
maintain a balance between software engineering and systems

Beyond the design of automation tools and other efforts to reduce the
workload for engineers in SRE, software development projects can
further benefit the SRE organization by attracting and helping to
retain engineers with a broad variety of skills. The desirability of
team diversity is doubly true for SRE, where a variety of backgrounds
and problem-solving approaches can help prevent blind spots. To this
end, Google always strives to staff its SRE teams with a mix of
engineers with traditional software development experience and
engineers with systems engineering experience.

Post topics: Operations