O'Reilly logo

Learning Real-time Processing with Spark Streaming by Sumit Gupta

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Performance tuning

Spark provides various configuration parameters which if used efficiently can significantly improve the overall performance of your Spark Streaming job. Let's look at a few of the features which can help us in tuning our Spark jobs.

Partitioning and parallelism

Spark Streaming jobs collect and buffer data at regular intervals (batch intervals) which is further divided into various stages of execution to form the execution pipeline. Each byte in the dataset is represented by RDD and the execution pipeline is called a Direct Acyclic Graph (DAG).

The dataset involved in each stage of the execution pipeline is further stored in the data blocks of equal sizes which is nothing more than the partitions represented by the RDD.

Lastly, for ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required