CHAPTER 7Parallel Tasks

A task represents an abstract unit of work in a program. It can be fine-grained, such as reading a file from the disk or sending a message to a receiver. Conversely, it could also be coarse-grained, for instance, predicting the next word in a sentence or simulating the weather over the course of a few days. It is important to understand a program as a collection of tasks, and just like deciding how granular a task should be, there are other factors we must consider when orchestrating tasks to produce data-intensive applications.

CPUs

The CPU is synonymous with data analytics. Even if we use accelerators like GPUs, we need the CPU to invoke those programs running in the GPUs. Getting the best out of a CPU for data analytics applications depends on many factors, including memory access (data structures), cache utilization, thread programming, and accessing shared resources.

Cache

Cache is one of the most important structures when it comes to performance in large data analytics. Data transfer between main memory and CPU is governed by throughput and latency. Throughput is the amount of data that can be transferred from the memory to CPU per second. Latency is how much time it takes for a CPU to issue a command to get data from memory and receive that data.

Memory bandwidth has increased over the years, but latency has remained mostly the same for a while now. A modern CPU can invoke many operations in the time it takes to fetch data from the main memory. ...

Get Foundations of Data Intensive Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.