437
28
ACache‐AwareHybridSorter
Manny Ko
DreamWorks Animation
Sorting is one of the most basic building blocks of many algorithms. In graphics,
a sort is commonly used for depth-sorting for transparency [Patney et al. 2010]
or
to get better Z-cull performance. It is a key part of collision detection [Lin 2000].
Dynamic state sorting is critical for minimizing state changes in a scene graph
renderer. Recently, Garanzha and Loop [2010] demonstrated that it is highly
profitable to buffer and sort rays within a ray tracer to extract better coherency,
which is key to high GPU performance. Ray sorting is one example of a well-
known practice in scientific computing, where parallel sorts are used to handle
irregular communication patterns and workloads.
Well, can’t we just use the standard template library (STL) sort? We can, but
we can also do better. How about up to six times better? Quicksort is probably
the best comparison-based sort and, on average, works well. However, its worst-
case behavior can be
2
On
, and its memory access pattern is not very cache-
friendly. Radix sort is the only practical
On sort out there (see the appendix for
a quick overview of radix sort). Its memory access pattern during the first pass,
where we are building counts, is very cache-friendly. However, the final output
phase uses random scatter writes. Is there a way for us to use radix sort but min-
imize its weaknesses?
Modern parallel external sort (e.g., AlphaSort [Nyberg et al. 1995]) almost
always uses a two-pass approach of in-memory sort followed by a merge. Each
item only has to be read from disk twice. More importantly, the merge phase is
very I/O friendly since the access is purely sequential. Substitute “disk” with
“main memory” and “memory” with “cache,” and the same considerations ap-
ply—we want to minimize reads from main memory and also love the sequential
access pattern of the merge phase.
438 28.ACache‐AwareHybridSorter
Hence, if we partition the input into substreams that fit into the cache and
sort each of them with radix sort, then the scatter writes now hit the cache, and
our main concern is addressed. One can substitute shared memory for cache in
the above statement and apply it to a GPU-based sort. Besides the scattering con-
cern, substreams also enable us to keep the output of each pass in cache so that it
is ready for the next pass without hitting main memory excessively.
Our variant of radix sort first makes one pass through the input and accumu-
lates four sets of counters, one for each radix digit. We are using radix-256,
which means each digit is one byte. Next, we compute the prefix sums of the
counters, giving us the final positions for each item. Finally, we make several
passes through the input, one for each digit, and scatter the items into the correct
order. The output of the scattering pass becomes the input to the next pass.
Radix sort was originally developed for integers since it relies on extracting
parts of it using bit operations. Applying it directly to floating-point values works
fine for positive numbers, but for negative numbers, the results are sorted in the
wrong order. One common approach is to treat the most significant radix digit as
a special case [Terdiman 2000]. However, that involves a test in the inner loop
that we would like to avoid. A nice bit hack by [Herf 2001] solves this nasty
problem for radix sort.
For efficient merging, we use an oldie but goodie called the
loser tree. It is a
lot more efficient than the common heap-based merger.
At the end we get a sorter that is two to six times faster than STL and has
stable performance across a wide range of datasets and platforms.
28.1StreamSplitting
This chapter discusses a class using Wright’s [2006] sample as a base to obtain
the cache and translation lookaside buffer description of various Intel and AMD
CPUs. The basic code is very simple since it uses the
cpuid instruction, as shown
in Listing 28.1. Some of the detailed information is not available directly from
the returned results. In the sample code on the website, a lot of the details based
on Intel’s application notes are manually entered. They are not needed for the
purpose of the sample code, which only requires the size of the caches.
U32 CPU::CacheSize(U32 cachelevel)
{
U32 ax, bx, cx, dx;
cx = cachelevel;
Get Game Engine Gems 2 now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.