Get full access to Hands-On GPU Programming with Python and CUDA and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

Start your free trial

Work-efficient parallel prefix — implementation

As a capstone for this chapter, we'll write an implementation of this algorithm that can operate on arrays of arbitrarily large size over 1,024. This will mean that this will operate over grids as well as blocks; that being such, we'll have to use the host for synchronization; furthermore, this will require that we implement two separate kernels for up-sweep and down-sweep phases that will act as the parfor loops in both phases, as well as Python functions that will act as the outer for loop for the up- and down-sweeps.

Let's begin with an up-sweep kernel. Since we'll be iteratively re-launching this kernel from the host, we'll also need a parameter that indicates current iteration (k). We'll ...

Get Hands-On GPU Programming with Python and CUDA now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Get it now

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

Start your free trial Become a member now