5 ways to make your operations more resilient in 2016

As software becomes increasingly complex, a focus on resilience is critical to meeting customer expectations and business goals.

By Courtney Nash
January 3, 2016
Bindweed plant breaking through asphalt Bindweed plant breaking through asphalt (source: Mark Dixon via Flickr)

Get the O’Reilly Web Ops and Performance Newsletter and receive weekly operations and performance insights from industry insiders. The following piece was first published in the Web Ops and Performance newsletter.

In their foundational book Resilience Engineering in Practice, Erik Hollnagel and his co-editors lay out the four key aspects of resilience as “the ability a) to respond to what happens, b) to monitor critical developments, c) to anticipate future threats and opportunities, and d) to learn from past experience—successes as well as failures.” As the software we build every day becomes increasingly complex—and notably reliant on other software often outside our own direct control—a focus on resilience is critical to meeting customer expectations and your business or organization’s goals. Here are a few resources to help you make this year a more resilient one.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

1. Put some sec in your DevOps

What the tech world needs now is probably not another portmanteau, but the fact remains that tacking security on at the end of your hard-won continuous delivery pipeline (if at all) means you’re likely releasing untested vulnerabilities. And as the last couple years have shown, dealing with security bugs or issues once they are out in the wild can be incredibly painful and costly. In his Velocity talk late last year, Pete Cheslock drives the point home that, along with building security in earlier to help avoid releasing security vulnerabilities, “The most important part of what Heartbleed and Bashbug… taught us is that those people that can move quickly to push changes and updates out are the ones that are going to be staying safe.”

2. Get to know your (3rd-party) neighbors

Even if you do tighten up your own security practices, odds are you’ve got 3rd-party scripts and services scattered throughout your code that can also be a drain on your site or app’s security and performance. In this lightning demo from Velocity Amsterdam, Guy Podjarny and Assaf Hefetz show how Snyk helps you easily find and fix vulnerabilities in your code coming from 3rd party components. And over on the SOASTA blog, Tammy Everts offers 10 solid tips on measuring and managing the performance of 3rd-party scripts.

3. Monitor what matters

We’ve seen an explosion in monitoring tools and companies over the past few years, no doubt related to the increasing dominance of virtualization, containers, and the cloud. But with the convenience and flexibility of all these developments comes complexity, and figuring out what to monitor (and how), can be vexing. James Turnbull’s 2015 monitoring survey revealed that many people are using multiple open-source and proprietary tools in combination. We’ve got a great video on Sensu available and an early release of our Graphite book, but even if you have your tools and implementation sorted out, here’s a question for you: What should wake someone up at 2am?

Turnbull’s survey revealed that a majority of people responding have some form of alerts go unanswered. He wasn’t able to fully tease out why (e.g., alert fatigue), but the point remains that alerting is (still) very tricky. It may help to focus on business metrics first; Kickstarter, where Turnbull is CTO, is tracking things like number of pledges, as a way to get a sense of their systems’ overall health, while Etsy’s NOC team monitors orders, not servers or network traffic. That’s not to say they don’t collect data on pretty much anything that moves, just that they focus on business metrics to decide when things are going sideways or not. So if the New Year has you reconsidering your monitoring, one idea is to start from the business—and what really impacts your customers—and work your way back from there.

4. Expect failure

By now, you may have heard of Netflix’s failure-inducing tools like Chaos Monkey, some of which can cause fairly wide-spread and heady cloud infrastructure failures. But failure can be less obvious, yet still very painful for your users (and/or your business). In his early 2015 post, Ilya Grigorik describes what he called “Resilient Networking” which is the practice of building mobile sites and apps that can handle degradation in cellular network service. I checked back in with him to see what his take is on this a year later, and he saw significant progress that you can start taking advantage of now with service workers.

Service workers, which are essentially scripts that run in the background, help web developers provide commonly expected native-like experiences with offline functionality previously unavailable to web apps and sites, while also offering significant performance wins. They open up things like Application Shell Architectures and allow developers to more easily prevent 3rd-party single points of failure (SPOF). For some inspiration on how you might start using service workers, check out this case study from Flipkart.

5. Be blameless

“The root cause of this kerfuffle is operator error. You need to deal with it. We can’t afford to have this happen again.”

If you work in operations for any kind of high-profile or high-volume web business, you’ve probably heard words like these in the past. The details may vary, but whenever something goes wrong—a server goes down, the network fails, orders stop coming in—we tend to rush to find a root cause, a reason why. And so often, we blame each other (or even sometimes, ourselves). But this rarely prevents similar problems from happening again at some point in the future. On top of that, blame can be the fuel for ongoing toxic cultural behaviors.

In his newly released book, Beyond Blame, Dave Zwieback takes readers through a fictional roller coaster ride of an outage at a major trading firm. Blending complexity science, resilience engineering, human factors, cognitive science, and organizational psychology, he urges us to understand how complex systems fail. And when they do, how to move beyond searching for who or what to blame, focusing instead on addressing areas of fragility within systems and organizations. As a first step in this direction, ask yourself what happens when something goes wrong where you work—what could you change about that to shift from blame to learning? Try a blameless postmortem and see where it takes you.

Post topics: Operations