## With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

No credit card required

## Chapter 5. Data Parallel Programming with Repa

The techniques we’ve seen in the previous chapters are great for parallelizing code that uses ordinary Haskell data structures like lists and `Map`s, but they don’t work as well for data-parallel algorithms over large arrays. That’s because large-scale array computations demand very good raw sequential performance, which we can get only by operating on arrays of unboxed data. We can’t use Strategies to parallelize operations over unboxed arrays, because they need lazy data structures (boxed arrays would be suitable, but not unboxed arrays). Similarly, `Par` doesn’t work well here either, because in `Par` the data is passed in `IVar`s.

In this chapter, we’re going to see how to write efficient numerical array computations in Haskell and run them in parallel. The library we’re going to use is called Repa, which stands for REgular PArallel arrays.[19] The library provides a range of efficient operations for creating arrays and operating on arrays in parallel.

The Repa package is available on Hackage. If you followed the instructions for installing the sample code dependencies earlier, then you should already have it, but if not you can install it with `cabal install`:

`\$ cabal install repa`

In this chapter, I’m going to use GHCi a lot to illustrate the behavior of Repa; trying things out in GHCi is a great way to become familiar with the types and operations that Repa provides. Because Repa provides many operations with the same names as `Prelude` functions (e.g., `map`), we usually import `Data.Array.Repa` with a short module alias:

`> import Data.Array.Repa as Repa`

This way, we can refer to Repa’s `map` function as `Repa.map`.

## Arrays, Shapes, and Indices

Everything in Repa revolves around arrays. A computation in Repa typically consists of computing an array in parallel, perhaps using other arrays as inputs. So we’ll start by looking at the type of arrays, how to build them, and how to work with them.

The `Array` type has three parameters:

``data` `Array` `r` `sh` `e``

The `e` parameter is the type of the elements, for example `Double`, `Int`, or `Word8`. The `r` parameter is the representation type, which describes how the array is represented in memory; I’ll come back to this shortly. The `sh` parameter describes the shape of the array; that is, the number of dimensions it has.

Shapes are built out of two type constructors, `Z` and `:.`:

````data` `Z` `=` `Z`
`data` `tail` `:.` `head` `=` `tail` `:.` `head````

The simplest shape, `Z`, is the shape of an array with no dimensions (i.e., a scalar), which has a single element. If we add a dimension, ```Z :. Int```, we get the shape of an array with a single dimension indexed by `Int`, otherwise known as a vector. Adding another dimension gives `Z :. Int :. Int`, the shape of a two-dimensional array, or matrix. New dimensions are added on the right, and the `:.` operator associates left, so when we write `Z :. Int :. Int`, we really mean `(Z :. Int) :. Int`.

The `Z` and `:.` symbols are both type constructors and value constructors, which can be a little confusing at times. For example, the data value `Z :. 3` has type `Z :. Int`. The data value form is used in Repa to mean either “shapes” or “indices.” For example, `Z :. 3` can be either the shape of three-element vectors, or the index of the fourth element of a vector (indices count from zero).

Repa supports only `Int`-typed indices. A few handy type synonyms are provided for the common shape types:

````type` `DIM0` `=` `Z`
`type` `DIM1` `=` `DIM0` `:.` `Int`
`type` `DIM2` `=` `DIM1` `:.` `Int````

Let’s try a few examples. A simple way to build an array is to use `fromListUnboxed`:

``fromListUnboxed` `::` `(``Shape` `sh``,` `Unbox` `a``)` `=>` `sh` `->` `[``a``]` `->` `Array` `U` `sh` `a``

The `fromListUnboxed` function takes a shape of type `sh` and a list of elements of type `a`, and builds an array of type `Array U sh a`. The `U` is the representation and stands for Unboxed: this array will contain unboxed elements. Don’t worry about the `Shape` and `Unbox` type classes. They are just there to ensure that we use only the appropriate shape constructors (`Z` and `:.`) and supported element types, respectively.

Let’s build a 10-element vector of `Int` and fill it with the numbers 1…10. We need to pass a shape argument, which will be `Z:.10` for a 10-element vector:

```> fromListUnboxed (Z :. 10) [1..10]

<interactive>:15:1:
No instance for (Shape (Z :. head0))
arising from a use of `fromListUnboxed'
The type variable `head0' is ambiguous
Possible fix: add a type signature that fixes these type variable(s)
Note: there is a potential instance available:
instance Shape sh => Shape (sh :. Int)
-- Defined in `Data.Array.Repa.Index'
In the expression: fromListUnboxed (Z :. 10) [1 .. 10]
In an equation for `it': it = fromListUnboxed (Z :. 10) [1 .. 10]```

Oops! This illustrates something that you will probably encounter a lot when working with Repa: a type error caused by insufficient type information. In this case, the integer `10` in `Z :. 10` is overloaded, so we have to say explicitly that we mean `Int`. There are many ways to give GHC the extra bit of information it needs; one way is to add a type signature to the whole expression, which has type `Array U DIM1 Int`:

```> fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [1,2,3,4,5,6,7,8,9,10])```

Similarly, we can make a two-dimensional array, with 3 rows of 5 columns, and fill it with the elements 1 to 15:

```> fromListUnboxed (Z :. 3 :. 5) [1..15] :: Array U DIM2 Int
AUnboxed ((Z :. 3) :. 5) (fromList [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])```

Conceptually, the array we created is this:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

But internally, the array is stored as a single vector (after all, computer memory is one-dimensional). We can see the vector in the result of the call to `fromListUnboxed`; it contains the same elements that we initialized the array with.

The shape of the array is there to tell Repa how to interpret the operations on it. For example, if we ask for the element at index `Z:.2:.1` in an array with shape `Z:.3:.5`, we’ll get the element at position 2 * 5 + 1 in the vector. We can try it using the `!` operator, which extracts an element from an array. The type of `!` is:

``(``!``)` `::` `(``Shape` `sh``,` `Source` `r` `e``)` `=>` `Array` `r` `sh` `e` `->` `sh` `->` `e``

Let’s get the element at position `Z:.2:.1` from our example matrix:

```> let arr = fromListUnboxed (Z :. 3 :. 5) [1..15] :: Array U DIM2 Int
> arr ! (Z:.2:.1)
12```

The element 12 is therefore 2 rows down and 1 column across. As I mentioned earlier, indices count from zero in Repa.

Internally, Repa is using the function `toIndex` to convert an index to an `Int` offset, given a shape:

``toIndex` `::` `Shape` `sh` `=>` `sh` `->` `sh` `->` `Int``

For example:

```> toIndex (Z:.3:.5 :: DIM2) (Z:.2:.1 :: DIM2)
11```

Because the layout of an array in memory is the same regardless of its shape, we can even change the shape without copying the array:

```> reshape (Z:.5:.3) arr ! (Z:.2:.1 :: DIM2)
8```

With the shape `Z:.5:.3`, the index `Z:.2:.1` corresponds to the element at 2 * 3 + 1 = 7, which has value 8.

Here are a couple of other operations on shapes that often come in handy:

````rank` `::` `Shape` `sh` `=>` `sh` `->` `Int`  `-- number of dimensions`
`size` `::` `Shape` `sh` `=>` `sh` `->` `Int`  `-- number of elements````

To retrieve the shape of an array, we can use `extent`:

``extent` `::` `(``Shape` `sh``,` `Source` `r` `e``)` `=>` `Array` `r` `sh` `e` `->` `sh``

For example:

```> extent arr
(Z :. 3) :. 5
> rank (extent arr)
2
> size (extent arr)
15```

## Operations on Arrays

We can map a function over an array using Repa’s `map` function:

````Repa.map` `::` `(``Shape` `sh``,` `Source` `r` `a``)`
`=>` `(``a` `->` `b``)` `->` `Array` `r` `sh` `a` `->` `Array` `D` `sh` `b````

We can see from the type that `map` returns an array with the representation `D`. The `D` representation stands for Delayed; this means that the array has not been computed yet. A delayed array is represented by a function from indices to elements.

We can apply `map` to an array, but there’s no way to print out the result:

```> let a = fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int
> Repa.map (+1) a

<interactive>:26:1:
No instance for (Show (Array D DIM1 Int))
arising from a use of `print'
Possible fix:
add an instance declaration for (Show (Array D DIM1 Int))
In a stmt of an interactive GHCi command: print it```

As its name suggests, a delayed array is not an array yet. To turn it into an array, we have to call a function that allocates the array and computes the value of each element. The `computeS` function does this for us:

``computeS` `::` `(``Load` `r1` `sh` `e``,` `Target` `r2` `e``)` `=>` `Array` `r1` `sh` `e` `->` `Array` `r2` `sh` `e``

The argument to `computeS` is an array with a representation that is a member of the `Load` class, whereas its result is an array with a representation that is a member of the `Target` class. The most important instances of these two classes are `D` and `U` respectively; that is, `computeS` turns a delayed array into a concrete unboxed array.[20].

Applying `computeS` to the result of `map` gives us an unboxed array:

```> computeS (Repa.map (+1) a) :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [2,3,4,5,6,7,8,9,10,11])```

You might be wondering why there is this extra complication—why doesn’t `map` just produce a new array? The answer is that by representing the result of an array operation as a delayed array, a sequence of array operations can be performed without ever building the intermediate arrays; this is an optimization called fusion, and it’s critical to achieving good performance with Repa. For example, if we composed two `map`s together:

```> computeS (Repa.map (+1) (Repa.map (^2) a)) :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [2,5,10,17,26,37,50,65,82,101])```

The intermediate array between the two `map`s is not built, and in fact if we compile this rather than running it in GHCi, provided the optimization option `-O` is enabled, it will compile to a single efficient loop over the input array.

Let’s see how it works. The fundamental way to get a delayed array is `fromFunction`:

``fromFunction` `::` `sh` `->` `(``sh` `->` `a``)` `->` `Array` `D` `sh` `a``

The `fromFunction` operation creates a delayed array. It takes the shape of the array and a function that specifies the elements. For example, we can make a delayed array that represents the vector of integers 0 to 9 like this:

```> let a = fromFunction (Z :. 10) (\(Z:.i) -> i :: Int)
> :t a
a :: Array D (Z :. Int) Int```

Delayed arrays support indexing, just like manifest arrays:

```> a ! (Z:.5)
5```

Indexing a delayed array works by just calling the function that we supplied to `fromFunction` with the given index.

We need to apply `computeS` to make the delayed array into a manifest array:

```> computeS a :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [0,1,2,3,4,5,6,7,8,9])```

The `computeS` function creates the array and for each of the indices of the array, it calls the function stored in the delayed array to find the element at that position.

The `map` function, along with many other operations on arrays, can be specified in terms of `fromFunction`. For example, here is a definition of `map`:

```> let mymap f a = fromFunction (extent a) (\ix -> f (a ! ix))
> :t mymap
mymap
:: (Shape sh, Source r e) =>
(e -> a) -> Array r sh e -> Array D sh a```

It works just like the real `map`:

```> computeS (mymap (+1) a) :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [1,2,3,4,5,6,7,8,9,10])```

What happens if we compose two `map`s together? The result would be a delayed array containing a function that indexes into another delayed array. So we’re building up a nested function that defines the array elements, rather than intermediate arrays. Furthermore, Repa is carefully engineered so that at compile time the nested function call is optimized away as far as possible, yielding very efficient code.

## Example: Computing Shortest Paths

In Example: Shortest Paths in a Graph, we looked at an implementation of the Floyd-Warshall algorithm for computing the lengths of shortest paths in a sparse weighted directed graph. Here, we’ll investigate how to code up the algorithm over dense graphs, using Repa.

For reference, here is the pseudocode definition of the algorithm:

```shortestPath :: Graph -> Vertex -> Vertex -> Vertex -> Weight
shortestPath g i j 0 = weight g i j
shortestPath g i j k = min (shortestPath g i j (k-1))
(shortestPath g i k (k-1) + shortestPath g k j (k-1))```

We implement this by first computing all the shortest paths for ```k == 0```, then `k == 1`, and so on up to the maximum vertex in the graph.

For the dense version, we’re going to use an adjacency matrix; that is, a two-dimensional array indexed by pairs of vertices, where each element is the length of the path between the two vertices. Here is our representation of graphs:

fwdense.hs

````type` `Weight` `=` `Int`
`type` `Graph` `r` `=` `Array` `r` `DIM2` `Weight````

The implementation of the shortest paths algorithm is as follows:

````shortestPaths` `::` `Graph` `U` `->` `Graph` `U`
`shortestPaths` `g0` `=` `go` `g0` `0`                                                `-- `
`where`
`Z` `:.` `_` `:.` `n` `=` `extent` `g0`                                               `-- `

`go` `!``g` `!``k` `|` `k` `==` `n`    `=` `g`                                              `-- `
`|` `otherwise` `=`
`let` `g'` `=` `computeS` `(``fromFunction` `(``Z:.``n``:.``n``)` `sp``)`            `-- `
`in`  `go` `g'` `(``k``+``1``)`                                          `-- `
`where`
`sp` `(``Z:.``i``:.``j``)` `=` `min` `(``g` `!` `(``Z:.``i``:.``j``))` `(``g` `!` `(``Z:.``i``:.``k``)` `+` `g` `!` `(``Z:.``k``:.``j``))` `-- ````

The number of vertices in the graph, `n`, is found by pattern-matching on the shape of the input graph, which we get by calling `extent`.

We need to loop over the vertices, with `k` taking values from 0 up to `n - 1`. This is done with a local recursive function `go`, which takes the current graph `g` and `k` as arguments. The initial value for `g` is `g0`, the input graph, and the initial value for `k` is 0.

The first case in `go` applies when we have looped over all the vertices, and `k == n`. The result is the current graph, `g`.

Here is the interesting case. We’re going to build a new adjacency matrix, `g'`, for this step using `fromFunction`. The shape of the array is `Z:.n:.n`, the same as the input, and the function to compute each element is `sp` (discussed later).

To manifest the new graph, we call `computeS`. Do we have to call `computeS` for each step, or could we wait until the end? If we don’t manifest the graph at each step, then we will be calling a nest of `k` functions every time we index into the current graph, `g`, which is exactly what this dynamic-programming solution seeks to avoid. So we must manifest the graph at each step.

Recursively call `go` to continue with the next step, passing the new graph we just computed, `g'`, and the next value of `k`.

The `sp` function computes the value of each element in the new matrix and is a direct translation of the pseudocode: the shortest path between `i` and `j` is the minimum of the current shortest path, and the shortest path that goes from `i` to `k` and then to `j`, all of which we get by indexing into the current graph, `g`.

The code is quite readable and somewhat shorter than the sparse version of the algorithm we saw before. However, there are a couple of subtleties that might not be obvious, but are nevertheless important for making the code run quickly:

• I deliberately used an explicit recursive function, `go`, rather than something like `foldl'`, even though the latter would lead to shorter code. The optimizations in Repa work much better when all the code is visible to the compiler, and calling out to library functions can sometimes hide details from GHC and prevent optimizations. There are no hard and fast rules here; I experimented with both the explicit version and the `foldl'` version, and found the explicit loop faster.
• There are bang-patterns on the arguments to `go`. This is good practice for iterative loops like this one and helps Repa to optimize the loop.

Let’s go ahead and compile the program and try it out on a 500-vertex graph:

```> ghc fwdense.hs -O2 -fllvm
[1 of 1] Compiling Main             ( fwdense.hs, fwdense.o )
> ./fwdense 500 +RTS -s
31250125000
1,077,772,040 bytes allocated in the heap
31,516,280 bytes copied during GC
10,334,312 bytes maximum residency (171 sample(s))
2,079,424 bytes maximum slop
32 MB total memory in use (3 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0       472 colls,     0 par    0.01s    0.01s     0.0000s    0.0005s
Gen  1       171 colls,     0 par    0.03s    0.03s     0.0002s    0.0063s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    1.46s  (  1.47s elapsed)
GC      time    0.04s  (  0.04s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    1.50s  (  1.50s elapsed)```

Note that I added a couple of optimization options: `-O2` turns up GHC’s optimizer, and `-fllvm` enables GHC’s LLVM backend, which significantly improves the performance of Repa code; on my machine with this particular example, I see a 40% improvement from `-fllvm`.[21]

### Parallelizing the Program

Now to make the program run in parallel. To compute an array in parallel, Repa provides a variant of the `computeS` operation, called `computeP`:

````computeP` `::` `(``Monad` `m``,` `Source` `r2` `e``,` `Target` `r2` `e``,` `Load` `r1` `sh` `e``)`
`=>` `Array` `r1` `sh` `e`
`->` `m` `(``Array` `r2` `sh` `e``)````

Whereas `computeS` computes an array sequentially, `computeP` uses the available cores to compute the array in parallel. It knows the size of the array, so it can divide the work equally amongst the cores.

The type is almost the same as `computeS`, except that `computeP` takes place in a monad. It works with any monad, and it doesn’t matter which monad is used because the purpose of the monad is only to ensure that `computeP` operations are performed in sequence and not nested. Hence we need to modify our code so that the `go` function is in a monad, which entails a few small changes. Here is the code:

````shortestPaths` `::` `Graph` `U` `->` `Graph` `U`
`shortestPaths` `g0` `=` `runIdentity` `\$` `go` `g0` `0`                      `-- `
`where`
`Z` `:.` `_` `:.` `n` `=` `extent` `g0`

`go` `!``g` `!``k` `|` `k` `==` `n`    `=` `return` `g`                           `-- `
`|` `otherwise` `=` `do`
`g'` `<-` `computeP` `(``fromFunction` `(``Z:.``n``:.``n``)` `sp``)`   `-- `
`go` `g'` `(``k``+``1``)`
`where`
`sp` `(``Z:.``i``:.``j``)` `=` `min` `(``g` `!` `(``Z:.``i``:.``j``))` `(``g` `!` `(``Z:.``i``:.``k``)` `+` `g` `!` `(``Z:.``k``:.``j``))````

We need to use a monad, so the `Identity` monad will do.

Remember to `return` the result, as we’re now in a monad.

Instead of `let` to bind `g'`, we use `do` and monadic bind and replace `computeS` with `computeP`. There are no differences to the `fromFunction` call or the `sp` function.

To run it in parallel, we’ll need to add the `-threaded` option when compiling. Let’s see how it performs:

```> ghc -O2 fwdense1 -threaded -fllvm  -fforce-recomp
[1 of 1] Compiling Main             ( fwdense1.hs, fwdense1.o )
> ./fwdense1 500 +RTS -s
31250125000
...
Total   time    1.89s  (  1.91s elapsed)```

There’s some overhead for using `computeP`, which here seems to be about 27%. That’s quite high, but we can recoup it by using more cores. With four cores:

```> ./fwdense1 500 +RTS -s -N4
31250125000
...
Total   time    2.15s  (  0.57s elapsed)```

That equates to a 2.63 speedup against the sequential version, for almost zero effort. Not bad!

## Folding and Shape-Polymorphism

Folds are an important class of operations over arrays; they are the operations that perform a collective operation over all the elements of an array to produce a single result, such as summing the array or finding its maximum element. For example, the function `sumAllS` calculates the sum of all the elements in an array:

````sumAllS`
`::` `(``Num` `a``,` `Shape` `sh``,` `Source` `r` `a``,` `Unbox` `a``,` `Elt` `a``)`
`=>` `Array` `r` `sh` `a`
`->` `a````

For an array of elements of type `a` that supports addition (the `Num` constraint), `sumAllS` produces a single result that is the sum of all the elements:

```> let arr = fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int
> sumAllS arr
55```

But sometimes we don’t want to fold over the whole array. There are occasions where we need to fold over just one dimension. For example, in the shortest paths example, suppose we wanted to take the resulting matrix of path lengths and find for each vertex the furthest distance we would have to travel from that vertex to any other vertex in the graph.

Our graph may have some nodes that are not connected, and in that case we represent the distance between them by a special large value called `inf` (the value of `inf` doesn’t matter as long as it is larger than all the path lengths in the graph). For the purposes of finding the maximum distance to other nodes, we’ll ignore nodes that are not reachable and hence have path length `inf`. So the function to compute the maximum of two path lengths is as follows:

````maxDistance` `::` `Weight` `->` `Weight` `->` `Weight`
`maxDistance` `x` `y`
`|` `x` `==` `inf`  `=` `y`
`|` `y` `==` `inf`  `=` `x`
`|` `otherwise` `=` `max` `x` `y````

Now we want to fold `maxDistance` over just one dimension of our two-dimensional adjacency matrix. There is a function called `foldS` that does just that; here is its type:

````foldS` `::` `(``Shape` `sh``,` `Source` `r` `a``,` `Elt` `a``,` `Unbox` `a``)`
`=>` `(``a` `->` `a` `->` `a``)`                           `-- `
`->` `a`                                       `-- `
`->` `Array` `r` `(``sh` `:.` `Int``)` `a`                   `-- `
`->` `Array` `U` `sh` `a`                            `-- ````

The function to fold.

The unitary value of type `a`.

The input array. Note that the shape is `(sh :. Int)`, which means that this is an array of some shape `sh` with one more dimension.

The output array has shape `sh`; that is, one dimension fewer than the input array. For example, if we pass in an array of shape `Z:.Int:.Int`, `sh` is `Z:.Int`. The fold takes place over the inner dimension of the array, which we normally think of as the rows. Each row is reduced to a single value.

The fwdense.hs program has a small test graph of six vertices:

```> extent testGraph
(Z :. 6) :. 6```

If we use `foldS` to fold `maxDistance` over the matrix of shortest paths, we obtain the maximum distance from each vertex to any other vertex:

```> foldS maxDistance inf (shortestPaths testGraph)
AUnboxed (Z :. 6) (fromList [20,19,31,18,15,21])```

And if we fold once more, we’ll find the longest distance between any two nodes (for which a path exists) in the graph:

```> foldS maxDistance inf (foldS maxDistance inf (shortestPaths testGraph))
AUnboxed Z (fromList [31])```

Note that the result this time is an array with zero dimensions, otherwise known as a scalar.

A function named `foldP` allows us to fold in parallel:

````foldP` `::` `(``Shape` `sh``,` `Source` `r` `a``,` `Elt` `a``,` `Unbox` `a``,` `Monad` `m``)`
`=>` `(``a` `->` `a` `->` `a``)`
`->` `a`
`->` `Array` `r` `(``sh` `:.` `Int``)` `a`
`->` `m` `(``Array` `U` `sh` `a``)````

For the same reasons as `computeP`, `foldP` is performed in an arbitrary monad. The arguments are the same as for `foldS`.

### Caution

The function argument used with `foldP` must be associative. That is, the function `f` must satisfy `f x (f y z) == f (f x y) z`. This is because unlike `foldS`, `foldP` doesn’t necessarily fold the function over the array elements in strict left-to-right order; it folds different parts of the array in parallel and then combines the results from those parts using the folding function.

Note that strictly speaking, although mathematical addition is associative, floating-point addition is not, due to rounding errors. However, we tend to ignore this detail when using `foldP` because a small amount of nondeterminism in the floating point result is normally acceptable.

## Example: Image Rotation

Repa is a great tool for coding image manipulation algorithms, which tend to be naturally parallel and involve a lot of data. In this section, we’ll write a program to rotate an image about its center by a specified number of degrees.

For reading and writing image data, Repa provides an interface to the DevIL library, which is a cross-platform C library for image manipulation. DevIL supports reading and writing various common image formats, including PNG and JPG. The library is wrapped by the Haskell package `repa-devil`, which provides a convenient Haskell API to DevIL. The two operations we’ll be using are `readImage` and `writeImage`:

````readImage`  `::` `FilePath` `->` `IL` `Image`
`writeImage` `::` `FilePath` `->` `Image` `->` `IL` `()````

Where the `Image` type defines various in-memory image representations:

````data` `Image`
`=` `RGBA` `(``Array` `F` `DIM3` `Word8``)`
`|` `RGB`  `(``Array` `F` `DIM3` `Word8``)`
`|` `BGRA` `(``Array` `F` `DIM3` `Word8``)`
`|` `BGR`  `(``Array` `F` `DIM3` `Word8``)`
`|` `Grey` `(``Array` `F` `DIM2` `Word8``)````

A color image is represented as a three-dimensional array. The first two dimensions are the Y and X axes, and the last dimension contains the three color channels and optionally an alpha channel. The first four constructors of `Image` correspond to different orderings of the color channels and the presence or not of an alpha channel. The last option, `Grey`, is a grayscale image with one byte per pixel.

Which one of these is returned by `readImage` depends on the type of image file being read. For example, a color JPEG image returns data in `RGB` format, but a PNG image returns in `RGBA` format.

You may have noticed one unfamiliar aspect to these array types: the `F` representation type. This indicates that the array data is held in foreign memory; that is, it was allocated by C code. Apart from being allocated by C rather than Haskell, the `F` representation is identical to `U`.

Note that `readImage` and `writeImage` are in the `IL` monad. The purpose of the `IL` monad is to ensure that the DevIL library is initialized properly. This is done by `runIL`:

``runIL` `::` `IL` `a` `->` `IO` `a``

It’s perfectly fine to have multiple calls to `runIL`; the library will be initialized only once.

Our program will take three arguments: the number of degrees to rotate the image by, the input filename, and the output filename, respectively:

````main` `::` `IO` `()`
`main` `=` `do`
`[``n``,` `f1``,``f2``]` `<-` `getArgs`
`runIL` `\$` `do`
`(``RGB` `v``)` `<-` `readImage` `f1`                                            `-- `
`rotated` `<-` `computeP` `\$` `rotate` `(``read` `n``)` `v` `::` `IL` `(``Array` `F` `DIM3` `Word8``)` `-- `
`writeImage` `f2` `(``RGB` `rotated``)`                                        `-- ````

Read the image data from the file `f1` (the second command-line argument).

The function `rotate`, which we will define shortly, returns a delayed array representing the rotated image. We call `computeP` here to calculate the new array in parallel. In the earlier examples, we used `computeP` to produce arrays with `U` representation, but here we’re producing an array with `F` representation. This is possible because `computeP` is overloaded on the desired output representation; this is the purpose of the `Target` type class.

Finally, write the new image to the file `f2`.

Next we’ll write the function `rotate`, which actually calculates the rotated image data. First, we have a decision to make: what should the size of the rotated image be? We have the option of producing a smaller image than the input, and discarding any pixels that fall outside the boundaries after rotation, or to adjust the image size to contain the rotated image, and fill in the empty areas with something else (e.g., black). I’ll opt, somewhat arbitrarily, to keep the output image size the same as the input and fill in the empty areas with black. Please feel free to modify the program to do something more sensible.

````rotate` `::` `Double` `->` `Array` `F` `DIM3` `Word8` `->` `Array` `D` `DIM3` `Word8`
`rotate` `deg` `g` `=` `fromFunction` `(``Z` `:.` `y` `:.` `x` `:.` `k``)` `f`      `-- `
`where`
`sh``@``(``Z` `:.` `y` `:.` `x` `:.` `k``)`   `=` `extent` `g`

`!``theta` `=` `pi``/``180` `*` `deg`                         `-- `

`!``st` `=` `sin` `theta`                               `-- `
`!``ct` `=` `cos` `theta`

`!``cy` `=` `fromIntegral` `y` `/` `2` `::` `Double`            `-- `
`!``cx` `=` `fromIntegral` `x` `/` `2` `::` `Double`

`f` `(``Z` `:.` `i` `:.` `j` `:.` `k``)`                          `-- `
`|` `inShape` `sh` `old` `=` `g` `!` `old`                  `-- `
`|` `otherwise`      `=` `0`                        `-- `
`where`
`fi` `=` `fromIntegral` `i` `-` `cy`                  `-- `
`fj` `=` `fromIntegral` `j` `-` `cx`

`i'` `=` `round` `(``st` `*` `fj` `+` `ct` `*` `fi` `+` `cy``)`       `-- `
`j'` `=` `round` `(``ct` `*` `fj` `-` `st` `*` `fi` `+` `cx``)`

`old` `=` `Z` `:.` `i'` `:.` `j'` `:.` `k`                  `-- ````

The formula to rotate a point (x,y) by an angle θ about the origin is given by:

x′ = y sin θ + x cos θ
y′ = y cos θ + x sin θ

However, we want to rotate our image about the center, but the origin is the upper-left corner. Hence we need to adjust the points to be relative to the center of the image before translation and adjust them back afterward.

We’re creating a delayed array, represented by the function `f`. The dimensions of the array are the same as the input array, which we get by calling `extent` just below.

Convert the angle by which to rotate the image from degrees to radians.

Because we’ll need the values of `sin theta` and `cos theta` twice each, we defined them once here.

`cy` and `cx` are the y- and x-coordinates, respectively, of the center of the image.

The function `f`, which gives the value of the new image at position `i`, `j`, `k` (where `k` here is between 0 and 2, corresponding to the RGB color channels).

First, we need to check whether the `old` pixel (the pixel we are rotating into this position) is within the bounds of the original image. The function `inShape` does this check for us:

``inShape` `::` `Shape` `sh` `=>` `sh` `->` `sh` `->` `Bool``

If the `old` pixel is within the image, then we return the value at that position in the old image.

If the rotated position in the old image is out of bounds, then we return zero, giving a black pixel at this position in the new image.

`fi` and `fj` are the y and x values of this point relative to the center of the image, respectively.

`i'` and `j'` are the coordinates of the pixel in the old image that will be rotated to this position in the new image, given by the previous formulae for `st` and `ct`.

Finally, `old` is the index of the pixel in the old image.

To see the program working, we first need an image to rotate: Figure 5-1.

Running the program like so results in the straightened image shown in Figure 5-2:

`\$ ./rotateimage 4 wonky.jpg straight.jpg`

Let’s check the performance of the program:

```\$ rm straight.jpg
\$ ./rotateimage 4 wonky.jpg straight.jpg +RTS -s
...
Total   time    0.69s  (  0.69s elapsed)```

And see how much we can gain by running it in parallel, on four cores:

```\$ ./rotateimage 4 wonky.jpg straight.jpg +RTS -s -N4
...
Total   time    0.76s  (  0.24s elapsed)```

The result is a speedup of 2.88. However, this program spends 0.05s of its time reading and writing the image file (measured by modifying the program to omit the rotation step), and if we factor this into the results, we obtain a speedup for the parallel portion of the program of 3.39.

## Summary

Repa provides a convenient framework for describing array operations and has some significant benefits:

• Intermediate arrays are automatically eliminated when array operations are composed (fusion).
• Operations like `computeP` and `foldP` automatically parallelize using the available cores.

There are a couple of gotchas to bear in mind:

• Repa relies heavily on GHC’s optimizer and is quite sensitive to things like strictness annotations and `INLINE` pragmas. A good rule of thumb is to use both of these liberally. You might also need to use simpler code and fewer library functions so that GHC can see all of your code and optimize it.
• Don’t forget to add the `-fllvm` option if your computer supports it.

There’s much more to Repa that we haven’t covered. For example, Repa has support for stencil convolutions: a common class of image-processing algorithms in which a transformation on each pixel is calculated as some function of the surrounding pixels. For certain kinds of stencil functions that are known at compile time, Repa can generate specialized code that runs extremely fast.

To learn more, take a look at the full Repa documentation on Hackage.

[19] Note that we’re using Repa version 3.2 here; 3.0 had a somewhat different API.

[20] There are other array representations that aren’t covered in this chapter; for more details, see the Repa documentation

[21] You might not have LLVM installed on your computer, in which case the `-fllvm` option will not work. Don’t worry: Repa works perfectly well without it. The code will just be slower.

## With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

No credit card required