In any sufficiently complex system, failure is impossible to rule out. And failure can have wide ranging impacts, especially when you are responsible for over 1 terabit per second of Internet traffic. As a Content Delivery Network (CDN), Fastly operates a large internetwork and a global application environment which faces many threats, both from a reliability and security perspective. As an organization, we deliberately put in place a robust system that would allow us to rapidly identify, mitigate and contain these incidents, and that would ensure communication would flow effectively both within the company and with our customers.
We have many large customers, including Twitter, GitHub, and The Guardian, and our ability to maintain a stable edge environment for them to serve applications from is critical. We continuously work to ensure the entire organization takes this responsibility very seriously.
Where to find inspiration
When you start developing a program, it’s always good to look at how other types of people have solved similar problems. As engineers, we tend to specialize and think the power to solve a particular problem is in our hands. When we do that, we can forget a few things:
- Will we be able to ramp up engineers quickly enough?
- Will our partners, such as network providers, be prepared and ready to help us when we need them to?
- How do we know people on the frontline have the time and space to take care of basic needs (such as sleeping, eating, and de-stressing) during a prolonged incident?
It doesn’t take long before you realize established systems must already exist elsewhere—there’s the Incident Command System (ICS) originally developed to address issues in the inter-agency response to California and Arizona wildfires, for instance, and the Gold-Silver-Bronze command structure used in the United Kingdom. For us, Mark Imbriaco's "Incident Response at Heroku" talk from Surge 2011 was a huge inspiration for our initial framework.
While technology has its own unique characteristics, such as the ability to automate responses, many of these original issues still exist, so we took some of the best practices of existing models while developing our own system.
Understand what you’re defending against
It’s important to understand the types of issues you’re likely to face. Our system, which we refer to as the Incident Response Framework (IRF), came to be a catch-all for any issue that caused customer impact. As we professionalized over time, the system started specializing as well, and there are now specific plans in place covering smaller issues that may not yet cause customer impact, but may have the potential to do so in the future. In addition, a specific Security Incident Response Plan (SIRP) was developed that triggers on any issues that may have security repercussions.
We’ve engaged the IRF for security vulnerabilities that required immediate remediation, customer-visible outages and issues of critical systems that support our network, such as our monitoring and instrumentation tooling.
When engaged, we identify the severity of an issue based on customer impact and business risk, as well as the length of time the issue has manifested itself. Based on the severity, the team owning the affected service may be paged, or an organization-wide Incident Commander, who is responsible for orchestrating the entire incident response (identifying objectives, managing operations, assigning resources and responsibilities), is allocated.
Identify the issue
Identifying an issue is critical. Within Fastly, we have multiple mechanisms in place to monitor service-related issues, including open source monitoring tools such as Ganglia, log aggregation tools such as Graylog and Elasticsearch, and several custom-built alerting and reporting tools. Ensuring events from each of these services make it to our service owners and Incident Commanders in a timely manner is critical.
For this, every team owning a critical service maintains a pager rotation, and receives reports regarding their own services directly. Engineering teams, however, are empowered to classify and re-classify events as needed to ensure pager rotations do not become too onerous. Most of them develop their own integration that ensures we don’t suffer from alert fatigue on pageable events.
Events that do not lead to significant impact, but over time could indicate wider problems, are reviewed on a regular basis depending on their criticality, rather than leading to immediate action, 24/7. This ensures that over time, we can address many of the root causes that lead to issues prior to them randomizing an on-call engineer.
Ramp up the right people
Getting the right people together is always challenging. Engineers are humans too, and it’s important to value their private space and time to make sure they can be effective during the time they’re building our platform.
Each team within Fastly designates a particular individual to be on call during a specific time slot. In addition, most teams that are critical for live services have a secondary engineer on-call, who is also aware of their responsibility to jump in, in case of a major incident.
In addition, the company maintains a few critical people on-call for incidents that grow beyond a wider team, or have customer impact. The role of these individuals is different. They know they won’t be troubleshooting the issue directly, but they take on a number of critical roles that help ensure mistakes are minimized, and investigation progresses as quickly as possible. They:
- Coordinate actions across multiple responders.
- Alert and update internal stakeholders, and update customers on our status—or designate a specific person to do so.
- Evaluate the high-level issue and understand its impact.
- Consult with team experts on necessary actions.
- Call off or delay other activities that may impact resolution of the incident.
Being an incident commander is not a role someone is ready to tackle when they are new to the company. We select Incident Commanders based on their ability to understand our system architecture from the top down, so they have a mental model of how systems interrelate and impact each other. They are also well-versed in the structure of teams, so they know the right resources to speak with, and can reliably engage them. Finally, they’re excellent communicators, and are able to maintain a cool, calm, and structured approach to incident response.
A specific issue many incidents run into is volunteers. When an incident is taking place, many of our engineers understandably want to help, even if they have no direct responsibility to be involved. When not properly managed, this can have negative effects. The environment can get overly chatty, or it’s not clear who has picked up specific work. We’ve learned that removing people from the incident often is counterproductive—it demotivates people who want to work. Instead, we try to find opportunities to manage these volunteers, and either have them work on less critical items, or expand our scope of investigation beyond what we’d typically look at. This coordination happens in a different room than the main incident, and is often coordinated by someone other than the main Incident Commander, but results are constantly communicated by a single individual.
Communicate your status
Communication is critical. Both the method of how we communicate and what is communicated in these updates are important to consider.
We use Slack and email as typical communication channels within the company. To communicate status updates to our customers, we use an external statuspage hosted by Statuspage.io.
Interestingly, some of the services we rely on often are also Fastly customers. This means we can’t necessarily depend on them being online during each type of incident affecting our service. As a result, we’ve grown through various backups, from our own IRC network, through phone chains, to alternative messaging tools to ensure systems are available. We also worked with some critical vendors to ensure our instance of their service operated on versions of their product that were not hosted behind Fastly, to avoid these circular dependencies.
As for the message, over time we had to learn that various levels of individuals within the company have different needs for what they like to learn about an incident. In security incidents in particular, we assemble a specific group of executives who need to be briefed on very specific qualities of the incident—regardless of whether customer information was disclosed, or any critical systems were compromised.
Hence we’ve developed our processes to ensure Incident Commanders know what needs to be communicated, and to whom. During large incidents, quite often the Incident Commander delegates ownership of communication to a dedicated resource, to ensure we don’t fail at this. Over- or under-communicating an incident can erode the trust our customers place in us, or lead to bad decisions being made by other stakeholders.
Each incident, no matter how minor it may seem, is logged in our Incident Management ticketing system. Within 24 hours after the incident, the Incident Commander works with her or his team to develop an Incident Report, which is widely shared across the organization. We leverage the knowledge of the wider group involved to ensure it is as accurate as possible.
During this process, we use the time-proven “five whys” technique developed by Sakichi Toyoda of Toyota fame. The idea is simple: For every incident you ask why it took place, and for every answer you ask why again. As you ask this question enough, usually about five times, you get to the actual root cause of the issue. This technique is helpful in two ways—intermediate answers give us ideas about what we can do to mitigate a future incident or monitor for it more effectively, and the final answer tells us the underlying problem we likely have to address.
The root cause, as well as each issue that hampered either the identification or response to the incident, receives its own ticket, and incidents that have unresolved tickets are tracked on a weekly basis in an incident review, until all stakeholders are sufficiently assured that the right actions have been taken to prevent recurrence.
Incidents also can lead to new projects being initiated. For instance, brittle systems are often identified during these processes, and the additional visibility the organization gains may lead to the development of replacements or improvements.