72 High Performance Visualization
focus of this chapter, and in particular, 2-3 swap and radix-k for petascale
The motivation for studying image compositing is the same as in most
of this book: data sets consisting of trillions of grid points and thousands
of timesteps are being produced by machines with hundreds of thousands of
cores. Machine architectures are growing in size and in complexity, with multi-
dimensional topologies and specialized hardware such as smart network adap-
tors, direct-memory access, and graphics accelerators. Against this backdrop,
scientists in high energy physics, climate, and other domains are demand-
ing more of visualization: real-time, multivariate, time-varying methods that
maximize data locality and minimize data movement.
Image compositing is the ﬁnal step in sort-last parallel rendering (see 4.2).
In sort-last parallel rendering, each processor generates a ﬁnished image of its
subset of data and these images must be combined into one ﬁnal result.
Legacy image compositing algorithms were invented for much smaller sys-
tems. The history of the direct-send and binary-swap algorithms, began in the
mid-1990s. Optimization features, such as compression, identiﬁcation of active
pixels, and scheduling, further improved performance. In the early 2000s, pro-
duction compositing libraries implementing many of these features appeared,
some of which are still used today.
New advances in the last ﬁve years feature more architecture awareness
of HPC systems. The late 2000s yielded compositing algorithms with higher
degrees of concurrency, scalability, and ﬂexibility in the form of 2-3 swap
and radix-k algorithms. The early 2010s continued optimization at a very
large scale and also continued the implementation of the latest innovations
in production. This chapter highlights some results of these recent advances
with both theoretical and actual performance, and it concludes with future
directions in highly parallel image compositing.
5.2 Basic Concepts and Early Work in Compositing
The previous chapter classiﬁed parallel rendering according to when ras-
terized images are sorted : sort-ﬁrst, sort-middle, and sort-last. One way to
understand the diﬀerence in these methods is to identify what is distributed
and what is replicated among the processes. The term process is used to des-
ignate a task executed in parallel with other processes, where processes each
have separate memory address spaces and communicate via passing messages.
In sort-ﬁrst rendering, the image pixels are usually distributed, and the
data set is typically replicated. HPC applications generate data sets many
times larger than the memory capacity of a single node, so sort-ﬁrst rendering