Chapter 19. Chaos Engineering on a Database

Why Do We Need Chaos Engineering?

Ever since Netflix open sourced Chaos Monkey in 2011, this program has become more and more popular. If you want to build a distributed system, letting Chaos Monkey go a little crazy on your cluster can help build a more fault-tolerant, robust, and reliable system.1

TiDB is an open source, distributed, Hybrid Transactional/Analytical Processing (HTAP)2 database developed primarily by PingCAP. It stores what we believe is the most important asset for any database users: the data itself. One of the fundamental and foremost requirements of our system is to be fault-tolerant. Traditionally we run unit tests and integration tests to guarantee a system is production ready, but these cover just the tip of the iceberg as clusters scale, complexities amount, and data volumes increase by PB levels. Chaos Engineering is a natural fit for us. In this chapter, we will detail our practices and the specific reasons why a distributed system like TiDB needs Chaos Engineering.

Robustness and Stability

To build users’ trust in a newly released distributed database like TiDB, where data is saved in multiple nodes that communicate with each other, data loss or damage must be prevented at any time. But in the real world, failures can happen any time, anywhere, in a way we can never expect. So how can we survive them? One common way is to make our system fault-tolerant. If one service crashes, another ...

Get Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.