Chapter 1. Introducing Database Reliability Engineering

Our goal with this book is to provide the guidance and framework for you, the reader, to grow on the path to being a truly excellent database reliability engineer (DBRE). When naming the book we chose to use the words reliability engineer, rather than administrator.

Ben Treynor, VP of Engineering at Google, says the following about reliability engineering:

fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.

Todayâs database professionals must be engineers, not administrators. We build things. We create things. As engineers practicing devops, we are all in this together, and nothing is someone elseâs problem. As engineers, we apply repeatable processes, established knowledge, and expert judgment to design, build, and operate production data stores and the data structures within. As database reliability engineers, we must take the operational principles and the depth of database expertise that we possess one step further.

If you look at the non-storage components of todayâs infrastructures, you will see systems that are easily built, run, and destroyed via programmatic and often automatic means. The lifetimes of these components can be measured in days, and sometimes even hours or minutes. When one goes away, there is any number of others to step in and keep the quality of service at expected levels.

Our next goal is that you gain a framework of principles and practices for the design, building, and operating of data stores within the paradigms of reliability engineering and devops cultures. You can take this knowledge and apply it to any database technology or environment that you are asked to work in at any stage in your organizationâs growth.

Guiding Principles of the DBRE

As we sat down to write this book, one of the first questions we asked ourselves was what the principles underlying this new iteration of the database profession were. If we were redefining the way people approached data store design and management, we needed to define the foundations for the behaviors we were espousing.

Protect the Data

Traditionally, protecting data always has been a foundational principle of the database professional and still is. The generally accepted approach has been attempted via:

A strict separation of duties between the software and the database engineer
Rigorous backup and recovery processes, regularly tested
Well-regulated security procedures, regularly audited
Expensive database software with strong durability guarantees
Underlying expensive storage with redundancy of all components
Extensive controls on changes and administrative tasks

In teams with collaborative cultures, the strict separation of duties can become not only burdensome, but also restrictive of innovation and velocity. In ChapterÂ 8, Release Management, we will discuss ways to create safety nets and reduce the need for separation of duties. Additionally, these environments focus more on testing, automation, and impact mitigation than extensive change controls.

More often than ever, architects and engineers are choosing open source datastores that cannot guarantee durability the way that something like Oracle might have in the past. Sometimes, that relaxed durability gives needed performance benefits to a team looking to scale quickly. Choosing the right datastore, and understanding the impacts of those choices, is something we look at ChapterÂ 11. Recognizing that there are multiple tools based on the data you are managing and choosing effectively is rapidly becoming the norm.

Underlying storage has also undergone a significant change as well. In a world where systems are often virtualized, network and ephemeral storage is finding a place in database design. We will discuss this further in ChapterÂ 5.

Production Datastores on Ephemeral Storage

In 2013, Pinterest moved their MySQL database instances to run on ephemeral storage in Amazon Web Services (AWS). Ephemeral storage effectively means that if the compute instance fails or is shut down, anything stored on disk is lost. Pinterest chose the ephemeral storage option because of consistent throughput and low latency.

Doing this required substantial investment in automated and rock-solid backup and recovery, as well as application engineering to tolerate the disappearance of a cluster while rebuilding nodes. Ephemeral storage did not allow snapshots, which meant that the restore approach was full database copies over the network rather than attaching a snapshot in preparation for rolling forward of the transaction logs.

This shows that you can maintain data safety in ephemeral environments with the right processes and the right tools!

The new approach to data protection might look more like this:

Responsibility of the data shared by cross-functional teams.
Standardized and automated backup and recovery processes blessed by DBRE.
Standardized security policies and procedures blessed by DBRE and Security teams.
All policies enforced via automated provisioning and deployment.
Data requirements dictate the datastore, with evaluation of durability needs becoming part of the decision making process.
Reliance on automated processes, redundancy, and well-practiced procedures rather than expensive, complicated hardware.
Changes incorporated into deployment and infrastructure automation, with focus on testing, fallback, and impact mitigation.

Self-Service for Scale

A talented DBRE is a rarer commodity than a site reliability engineer (SRE) by far. Most companies cannot afford and retain more than one or two. So, we must create the most value possible, which comes from creating self-service platforms for teams to use. By setting standards and providing tools, teams are able to deploy new services and make appropriate changes at the required pace without serializing on an overworked database engineer. Examples of these kinds of self-service methods include:

Ensure the appropriate metrics are being collected from data stores by providing the correct plug-ins.
Building backup and recovery utilities that can be deployed for new data stores.
Defining reference architectures and configurations for data stores that are approved for operations, and can be deployed by teams.
Working with Security to define standards for data store deployments.
Building safe deployment methods and test scripts for database changesets to be applied.

In other words, the effective DBRE functions by empowering others and guiding them, not functioning as a gatekeeper.

Elimination of Toil

The Google SRE teams often use the phrase âElimination of Toil,â which is discussed in Chapter 5 of the Google SRE book. In the book, âtoilâ is defined as:

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

Effective use of automation and standardization is necessary to ensure that DBREs are not overburdened by toil. Throughout this book, we will be bringing up examples of DBRE-specific toil and the approaches to mitigation of this. That being said, the word âtoilâ is still vague, with lots of preconceptions that vary from person to person. When we discuss toil in this book, we are specifically talking about manual work that is repetitive, non-creative, and non-challenging.

Manual Database Changes

In many customer environments, database engineers are asked to review and apply DB changes, which can include modifications to tables or indexes, the addition, modification, or removal of data, or any other number of tasks. Everyone feels reassured that the DBA is applying these changes and monitoring the impact of the changes in real time.

At one customer site, the rate of change was quite high, and those changes were often impactful. We ended up spending about 20 hours a week applying rolling changes throughout the environment. Needless to say, the poor DBA who ended up spending half of their week running these repetitive tasks became jaded and ended up quitting.

Faced with a lack of resources, management finally allowed the DB team to build a rolling schema change automation utility that software engineers could use once the changeset had been reviewed and approved by one of the database engineers. Soon, everyone trusted the tool and monitoring to introduce change, paving the way for the DBRE team to focus more time on integrating these processes with the deployment stack.

Databases Are Not Special Snowflakes

Our systems are no more or less important than any other components serving the needs of the business. We must strive for standardization, automation, and resilience. Critical to this is the idea that the components of database clusters are not sacred. We should be able to lose any component and efficiently replace it without worry. Fragile data stores in glass rooms are a thing of the past.

The metaphor of pets versus cattle is often used to show the difference between a special snowflake and a commodity service component. Original attribution goes to Bill Baker, Microsoft Distinguished Engineer. A pet server, is one that you feed, care for, and nurture back to health when it is sick. It also has a name. At Travelocity in 2000, our servers were Simpsons characters, and our two SGI servers running Oracle were named Patty and Selma. I spent so many hours with those gals on late nights. They were high maintenance!

Cattle servers have numbers, not names. You donât spend time customizing servers, much less logging on to individual hosts. When they show signs of sickness, you cull them from the herd. You should, of course, keep those culled cattle around for forensics, if you are seeing unusual amounts of sickness. But, weâll refrain from mangling this metaphor any further.

Data stores are some of the last hold outs of âpethood.â After all, they hold âThe Data,â and simply cannot be treated as replaceable cattle with short lifespans and complete standardizations. What about the special replication rules for our reporting replica? What about the different config for the primaryâs redundant standby?

Eliminate the Barriers Between Software and Operations

Your infrastructure, configurations, data models, and scripts are all part of software. Study and participate in the software development lifecycle as any engineer would. Code, test, integrate, build, test, and deploy. Did we mention test?

This might be the hardest paradigm shift for someone coming from an operations and scripting background. There can be an organizational impedance mismatch in the way software engineers navigate an organization and the systems and services built to meet that organizationâs needs. Software engineering organizations have very defined approaches to developing, testing, and deploying features and applications.

In a traditional environment, the underlying process of designing, building, testing, and pushing infrastructure and related services to production was separate among software engineering (SWE), system engineering (SE), and DBA. The paradigm shifts discussed previously are pushing for removal of this impedance mismatch, which means DBREs and Systems Engineers find themselves needing to use similar methodologies to do their jobs.

Software Engineers Must Learn Operations!

Too often, operations folks are told to learn to âcode or to go home.â While I do agree with this, the reverse must be true as well. Software engineers who are not being pushed and led to learn operations and infrastructure principles and practices will create fragile, non-performant, and potentially insecure code. The impedance mismatch only goes away if all teams are brought to the same table!

DBREs might also find themselves embedded directly in a software engineering team, working in the same code base, examining how code is interacting with the data stores, and modifying code for performance, functionality, and reliability. The removal of this organizational impedance creates an improvement in reliability, performance, and velocity an order of magnitude greater than traditional models, and DBREs must adapt to these new processes, cultures, and tooling.

Operations Core Overview

One of the core competencies of the DBRE is operations. These are the building blocks for designing, testing, building, and operating any system with scale and reliability requirements that are not trivial. This means that if you want to be a database engineer, you need to know these things.

Operations at a macro level is not a role. Operations is the combined sum of all of the skills, knowledge, and values that your company has built up around the practice of shipping and maintaining quality systems and software. Itâs your implicit values as well as your explicit values, habits, tribal knowledge, and reward systems. Everybody, from tech support to product people to the CEO participates in your operational outcomes.

Too often, this is not done well. So many companies have an abysmal ops culture that burns out whoever gets close to it. This can give the discipline a bad reputation, which many folks think of when they think of operations jobs, whether in systems, database, or network. Despite this, your ops culture is an emergent property of how your org executes on its technical mission. So if you go and tell us that your company doesnât do any ops, we just wonât buy it.

Perhaps you are a software engineer or a proponent of infrastructure and platforms as a service. Perhaps you are dubious that operations is a necessity for the intrepid database engineer. The idea that serverless computing models will liberate software engineers from needing to think or care about operational impact is flat out wrong. It is actually the exact opposite. Itâs a brave new world where you have no embedded operations teamsâwhere the people doing operations engineering for you are Google SREs and AWS systems engineers and PagerDuty and DataDog and so on. This is a world where application engineers need to be much better at operations, architecture, and performance than they currently are.

Hierarchy of Needs

Some of you will be coming at this book with experience in enterprises and some in startups. As we approach and consider systems, it is worth thinking about what you would do on day one of taking on the responsibility of operating a database system. Do you have backups? Do they work? Are you sure? Is there a replica you can fail over to? Do you know how to do that? Is it on the same power strip, router, hardware, or availability zone as the primary? Will you know if the backups start failing somehow? How?

In other words, we need to talk about a hierarchy of database needs.

For humans, Maslowâs hierarchy of needs is a pyramid of desire that must be satisfied for us to flourish: physiological survival, safety, love and belonging, esteem, and self-actualization. At the base of the pyramid are the most fundamental needs, like survival. Each level roughly proceeds to the nextâsurvival before safety, safety before love and belonging, and so forth. Once the first four levels are satisfied, we reach self-actualization, which is where we can safely explore and play and create and reach the fullest expression of our unique potential. So thatâs what it means for humans. Letâs apply this as a metaphor for what databases need.

Survival and Safety

Your databaseâs most essential needs are backups, replication, and failover. Do you have a database? Is it alive? Can you ping it? Is your application responding? Does it get backed up? Will restores work? How will you know if this stops being true?

Is your data safe? Are there multiple live copies of your data? Do you know how to do a failover? Are your copies distributed across multiple physical availability zones or multiple power strips and racks? Are your backups consistent? Can you restore to a point in time? Will you know if your data gets corrupted? How? Plan on exploring this much more in the backup and recovery section.

This is also the time when you start preparing for scale. Scaling prematurely is a foolâs errand, but you should consider sharding, growth, and scale now as you determine ids for key data objects, storage systems, and architecture.

Scaling Patterns

We will discuss scale quite frequently. Scalability is the capability of a system or service to handle increasing amounts of work. This might be actual ability, because everything has been deployed to support the growth, or it might be potential ability, in that the building blocks are in place to handle the addition of components and resources required to scale. There is a general consensus that scale has four pathways that will be approached.

Scale vertically, via resource allocation. aka scale up
Scale horizontally, by duplication of the system or service. aka scale out
Separate workloads to smaller sets of functionality, to allow for each to scale independently, also known as functional partitioning
Split specific workloads into partitions that are identical, other than the specific set of data that is being worked on also known as sharding

The specific aspects of these patterns will be reviewed in ChapterÂ 5, Infrastructure Engineering.

Love and Belonging

Love and belonging is about making your data a first-class citizen of your software engineering processes. Itâs about breaking down silos between your databases and the rest of your systems. This is both technical and cultural, which is why you could also just call this the âdevops needs.â At a high level, it means that managing your databases should look and feel (as much as possible) like managing the rest of your systems. It also means that you culturally encourage fluidity and cross-functionality. The love and belonging phase is where you slowly stop logging in and performing cowboy commands as root.

It is here where you begin to use the same code review and deployment practices. Database infrastructure and provisioning should be part of the same process as all other architectural components. Working with data should feel consistent to all other parts of the application, which should encourage anyone to feel they can engage with and support the database environment.

Resist the urge to instill fear in your developers. Itâs quite easy to do and quite tempting because it feels better to feel like you have control. Itâs notâand you donât. Itâs much better for everyone if you invest that energy into building guard rails so that itâs harder for anyone to accidentally destroy things. Educate and empower everyone to own their own changes. Donât even talk about preventing failure, as such is impossible. In other words, create resilient systems and encourage everyone to work with the datastore as much as possible.

Esteem

Esteem is the highest of the needs in the pyramid. For humans, this means respect and mastery. For databases, this means things like observability, debuggability, introspection, and instrumentation. Itâs about being able to understand your storage systems themselves, but also being able to correlate events across the stack. Again, there are two aspects to this stage: one of them is about how your production services evolve through this phase, and the other is about your humans.

Your services should tell you if theyâre up or down or experiencing error rates. You should never have to look at a graph to find this out. As your services mature, the pace of change slows down a bit as your trajectory becomes more predictable. Youâre running in production so youâre learning more every day about your storage systemâs weaknesses, behaviors, and failure conditions. This can be compared to teenager years for data infrastructure. What you need more than anything is visibility into what is going on. The more complex your product is, the more moving pieces there are and the more engineering cycles you need to allocate into developing the tools you need to figure out whatâs happening.

You also need knobs. You need the ability to selectively degrade quality of service instead of going completely down, e.g.:

Flags where you can set the site into read-only mode
Disabling certain features
Queueing writes to be applied later
The ability to blacklist bad actors or certain endpoints

Your humans have similar but not completely overlapping needs. A common pattern here is that teams will overreact once they get into production. They donât have enough visibility, so they compensate by monitoring everything and paging themselves too often. It is easy to go from zero graphs to literally hundreds of thousands of graphsâ99% of which are completely meaningless. This is not better. It can actually be worse. If it generates so much noise that your humans canât find the signal and are reduced to tailing log files and guessing again, itâs as bad or worse than not having the graphs.

This is where you can start to burn out your humans by interrupting them, waking them up, and training them not to care or act on alerts they do receive. In the early stages, if youâre expecting everyone to be on call, you need to document things. When youâre bootstrapping, you have shared on call, and youâre pushing people outside of their comfort zones, give them a little help. Write minimally effective documentation and procedures.

Self-actualization

Just like every personâs best possible self is unique, every organizationâs self-actualized storage layer is unique. The platonic ideal of a storage system for Facebook doesnât look like the perfect system for Pinterest or Github, let alone a tiny startup. But just like there are patterns for healthy, self-actualized humans (doesnât throw tantrums in the grocery store, they eat well and exercise), there are patterns for what we can think of as healthy, self-actualized storage systems.

In this context, self-actualization means that your data infrastructure helps you get where youâre trying to go and that your database workflows are not obstacles to progress. Rather, they empower your developers to get work done and help save them from making unnecessary mistakes. Common operational pains and boring failures should remediate themselves and keep the system in a healthy state without needing humans to help. It means you have a scaling story that works for your needs. Whether that means 10xâing every few months or just being rock solid, stable, and dumb for three years before you need to worry about capacity. Frankly, you have a mature data infrastructure when you can spend most of your time thinking about other things. Fun things. Like building new products or anticipating future problems instead of reacting to current ones.

Itâs okay to float back and forth between levels over time. The levels are mostly there as a framework to help you think about relative priorities, like making sure you have working backups is way more important than writing a script to dynamically re-shard and add more capacity. Or if youâre still at the point where you have one copy of your data online, or you donât know how to fail over when your primary dies, you should probably stop whatever youâre doing and figure that out first.

Wrapping Up

The DBRE role is a paradigm shift from an existing, well-known role. More than anything, the framework gives us a new way to approach the functions of managing datastores in a continually changing world. In the upcoming section, we will begin exploring these functions in detail, prioritizing operational functions due to their importance in day-to-day database engineering. With that being said, letâs move bravely forward, intrepid engineer!

Get Database Reliability Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Database Reliability Engineering by Laine Campbell, Charity Majors