O'Reilly logo

High Performance Visualization by E. Wes Bethel, Charles Hansen, Hank Childs

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 12
Hybrid Parallelism
E. Wes Bethel
Lawrence Berkeley National Laboratory
David Camp
Lawrence Berkeley National Laboratory
Hank Childs
Lawrence Berkeley National Laboratory
Christoph Garth
University of Kaiserslautern
Mark Howison
Brown University
Kenneth I. Joy
University of California, Davis
David Pugmire
Oak Ridge National Laboratory
12.1 Introduction ...................................................... 262
12.2 Hybrid Parallelism and Volume Rendering ...................... 264
12.2.1 Background and Previous Work ......................... 264
12.2.2 Implementation .......................................... 264
12.2.2.1 Shared-Memory Parallel Ray Casting ..... 266
12.2.2.2 Parallel Compositing ....................... 266
12.2.3 Experiment Methodology ................................ 267
12.2.4 Results ................................................... 268
12.2.4.1 Initialization ............................... 268
12.2.4.2 Ghost Data/Halo Exchange ............... 269
12.2.4.3 Ray Casting ................................ 269
12.2.4.4 Compositing ................................ 272
12.2.4.5 Overall Performance ....................... 272
12.3 Hybrid Parallelism and Integral Curve Calculation ............. 275
12.3.1 Background and Context ................................ 275
261
262 High Performance Visualization
12.3.2 Design and Implementation ............................. 276
12.3.2.1 Parallelize Over Seeds ...................... 276
12.3.2.2 Parallelize Over Blocks .................... 277
12.3.3 Experiment Methodology ................................ 278
12.3.3.1 Factors Influencing Parallelization Strategy 278
12.3.3.2 Test Cases .................................. 279
12.3.3.3 Runtime Environment ..................... 279
12.3.3.4 Measurements .............................. 280
12.3.4 Results ................................................... 280
12.3.4.1 Parallelization Over Seeds ................. 280
12.3.4.2 Parallelization Over Blocks ................ 282
12.4 Conclusion and Future Work .................................... 283
References .......................................................... 287
Hybrid parallelism refers to a blend of distributed- and shared-memory paral-
lel programming techniques within a single application. This chapter presents
results from two studies. They aimed to explore the thesis that hybrid paral-
lelism offers performance advantages for visualization codes on multi-core plat-
forms. The findings show that, compared to a traditional distributed-memory
implementation, the hybrid parallel approach uses a smaller memory foot-
print, performs less interprocess communication, has faster execution speed,
and, for some configurations, performs significantly less data I/O.
12.1 Introduction
A distributed-memory parallel computer is made up of multiple nodes, with
each node containing one or more cores. Each instance of a parallel program is
called a task (or sometimes a Processing Element or PE). A pure distributed-
memory program has one task for each core on each node of the computer.
This is not necessary, however. Hybrid parallel programs have fewer tasks
per node than cores. They make use of the remaining cores by using threads,
which are lightweight programs controlled by the task. Threads can share
memory amongst themselves and between the main thread associated with
the task; allowing for optimizations that are not possible with distributed-
memory programming. For example, consider a distributed-memory parallel
computer with eight quad-core nodes. A pure distributed-memory program
would have thirty-two tasks running and none of these tasks would make use
of shared-memory techniques (although some cores would reside on the same
node). A hybrid configuration could have eight tasks, each running with four
threads, sixteen tasks, each running with two threads, or even configurations
where the number of tasks and threads per node varies.
This chapter defines and uses the following terminology and notation. Tra-
ditional parallelism,orP
T
, refers to a design and implementation that uses
Hybrid Parallelism 263
FIGURE 12.1: 4608
2
image of a combustion simulation result, rendered by
hybrid parallel MPI+pthreads implementation running on 216,000 cores of the
JaguarPF supercomputer. Image source: Howison et al., 2011 [14]. Combustion
simulation data courtesy of J. Bell and M. Day (LBNL).
only MPI for parallelism, regardless of whether the parallel application is run
on a distributed- or shared-memory system. Hybrid parallelism,orP
H
, refers
to a design and implementation that uses both MPI and some other form of
shared-memory parallelism like POSIX threads [4], OpenMP [3], OpenCL [12],
CUDA [9], and so forth.
The main focus of this chapter is to present results from two different ex-
periments within the field of high performance visualization that aim to study
the extent to which visualization algorithms can benefit from hybrid paral-
lelism when applied to today’s largest data sets and on today’s largest com-
putational platforms. The studies presented in this chapter use a P
H
design,
whereby each MPI task will in turn invoke some form of shared-memory par-
allelism on multi-core CPUs and many-core GPUs.
One experiment studies a hybrid parallel implementation of ray casting
volume rendering at extreme-scale concurrency. The other studies a hybrid
parallel implementation of integral curve computation using two different
approaches to parallelization. The material in this chapter consolidates in-
formation from earlier publications on hybrid parallelism for volume render-
ing [13, 14] and streamline/integral curve computations [6, 21]. Both of these
studies show that a P
H
implementation runs faster, uses less memory, and
performs less communication and data movement than its P
T
counterpart. In
some cases, the difference is quite profound, and reveals many insurmount-

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required