I recently sat down with Caskey Dickson, site reliability engineer and software engineer at Microsoft, to discuss the importance of strategic metrics, maximizing the customer experience, and how the role of operations is changing. Here are some highlights from our talk.
1. Why is it so important to think strategically about metrics?
Metrics are the windows into the health and behavior of your service. Without the right metrics you won’t know what is going on or how to fix it. Furthermore, having metrics of the appropriate resolution and granularity is needed. If you only know a metric’s average fleet-wide then you have no idea if there are particular server instances that are failing your users. Can you tell the difference in error rates and latency broken down not just by server but by software revision? Having the right metrics presented in the right way provides a smoking gun that drives down TTD (time to detect) and TTM (time to mitigate) for the benefit of your customers. Peace of mind is just a bonus.
2. You have both an MBA and a degree in systems engineering. What have you learned from stitching together these two bodies of knowledge?
It helps me understand the customer perspective better. Ultimately the software we build and deploy provides some capability to the end user, and they are willing to give us money for that. Without value, we don’t have a business and our jobs don’t exist. I like to ask engineers, “Do you understand the value your product provides to its customers?” Most answer with something like it lets them manage an inventory, or process credit cards. Customers don’t want to manage inventory, or process credit cards. Yes they need that to get done, but they are doing that as a means to an end. When you understand and support your customers’ desires you build better systems and get loyal customers.
3. IT operations has increasingly had a voice w/r/t customer experience. What role can ops play in product design and product management?
It’s common for ops to be seen as a cost center—perpetually spending money, and occasionally spending even more when things go down. We need to fight that perception by pointing out that every dollar of revenue is a result of ongoing operations. We spend money on ops to earn money. What this means is that the better ops can do its job, the more value we provide to the customer. More importantly, ops is often most directly connected to the customer experience. We sit on the other side of the connection our customers have to the organization. We need to find ways of characterizing the customer experience (e.g., through metrics) and take an active role in sharing that information with the rest of the company.
4. You have a lot of experience with site reliability engineering. What lessons can readers take from the SRE world, when it comes to delivering software?
SRE is just a means to an end. Whether you call yourself an SE, SRE, DevOps or something else, it’s all about delivering a customer experience that matters to them and to the organization. What is changing is the notion that software needs an “operator”. Modern systems engineers or operators are like the elevator operators of the early 20th century. At the time driving an elevator was a necessary thing because the device itself was insufficiently engineered. It may have been the best of what was possible at the time, but that changed. Similarly in the mid 20th century we had computer operators. Then the computer became personal. Operators have been the glue between insufficient/impossible engineering and users since forever. That is changing and will continue to change as our software becomes more self-reliant but at the same time more complex.
5. You're speaking at the Velocity Conference in Amsterdam this November. What presentations are you looking forward to attending while there?
This is genuinely hard to answer. Velocity has always had such a great variety of speakers. I don’t think I’ve regretted sitting in any talk, but if pressed I’d have to say my favorite sessions are always the hallway ones. I look forward to meeting the great people who collectively make the world of online and cloud services happen.