Chapter 26. Performance Tuning

The performance characteristics of a distributed streaming application are often dictated by complex relationships among internal and external factors involved in its operation.

External factors are bound to the environment in which the application executes, like the hosts that constitute the cluster and the network that connects them. Each host provides resources like CPU, memory, and storage with certain performance characteristics. For example, we might have magnetic disks that are typically slow but offer low-cost storage or fast solid-state drive (SSD) arrays that provide very fast access at a higher cost per storage unit. Or we might be using cloud storage, which is bound to the capacity of the network and the available internet connection. Likewise, the data producers are often outside of the control of the streaming application.

Under internal factors, we consider the complexity of the algorithms implemented, the resources assigned to the application, and the particular configuration that dictates how the application must behave.

In this chapter, we first work to gain a deeper understanding of the performance factors in Spark Streaming. Then, we survey several strategies that you can apply to tune the performance of an existing job.

The Performance Balance of Spark Streaming

Performance tuning in Spark Streaming can sometimes be complex, but it always begins with the simple equilibrium between the batch interval and the batch processing ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.