CHAPTER 9Fault Tolerance

Fault tolerance is a necessity in large-scale data processing systems. In this chapter, we will look at why we need to handle faults, the techniques available to support failure handling, and the trade-off between them. We will start with a general discussion on faults in distributed systems and narrow our discussion to batch and streaming data processing systems.

Dependable Systems and Failures

A computer consists of many parts: a motherboard, central processing unit (CPU), random access memory (RAM), hard drives, graphic processing units, power units, and network interface cards. Some of these components can be built into a single circuit board. For example, the Ethernet network interface cards are included in most motherboards. Each component has a probability of failure that increases with its use. In large distributed systems, computer networks connect thousands of computers with additional hardware thrown into the mix, including network switches, cooling systems, and power systems. Owing to the number of hardware components involved in large-scale distributed computations, the probability of part failure during a computation only increases.

Software failures are equally evident in large distributed applications. This includes operating system issues, insufficient resources such as memory, hard disk space, or network resources, and even bugs in the software. Applications encounter software failures more often than hardware failures in the initial ...

Get Foundations of Data Intensive Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.