Appendix B. A Collection of Best Practices for Production Services

Written by Ben Treynor Sloss

Edited by Betsy Beyer

Fail Sanely

Sanitize and validate configuration inputs, and respond to implausible inputs by both continuing to operate in the previous state and alerting to the receipt of bad input. Bad input often falls into one of these categories:

Incorrect data: Validate both syntax and, if possible, semantics. Watch for empty data and partial or truncated data (e.g., alert if the configuration is N% smaller than the previous version).
Delayed data: This may invalidate current data due to timeouts. Alert well before the data is expected to expire.

Fail in a way that preserves function, possibly at the expense of being overly permissive or overly simplistic. We’ve found that it’s generally safer for systems to continue functioning with their previous configuration and await a human’s approval before using the new, perhaps invalid, data.

Get Site Reliability Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Site Reliability Engineering by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

Appendix B. A Collection of Best Practices for Production Services

Fail Sanely

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly