O'Reilly logo

OpenGL Insights by Christophe Riccio, Patrick Cozzi

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Performance Tuning for
Tile-Based Architectures
Bruce Merry
23.1 Introduction
The OpenGL and OpenGL ES specifications describe a virtual pipeline in which
triangles are processed in order: the vertices of a triangle are transformed, the triangle
is set up and rasterized to produce fragments, the fragments are shaded and then
written to the framebuffer. Once this has been done, the next triangle is processed,
and so on. However, this is not the most efficient way for a GPU to work; GPUs will
usually reorder and parallelize things under the hood for better performance.
In this chapter, we will examine tile-based rendering, a par ticular way to arrange
a graphics pipeline that is used in several popular mobile GPUs. We will look at
what tile-based rendering is and why it is used and then look at what needs to be
done differently to achieve optimal performance. I assume that the reader already
has experience with optimizing OpenGL applications and is familiar with the stan-
dard techniques, such as reducing state changes, reducing the n umber of draw calls,
reducing shader complexity and texture compression, and is looking for advice that
is specific to tile-based GPUs.
Keep in mind that every GPU, every driver, and every application is different and
will have different performance characteristics [Qua 10]. Ultimately, per formance-
tuning is a process of profiling and experimentation. Thus, this chapter contains
very few hard-and-fast rules but instead tries to illustrate ho w to estimate the costs
associated with different approaches.
This chapter is about maximizing performance, but since tile-based GPUs are
currently popular in mobile devices, we will also briefly mention power consumption.
Many desktop applications will simply render as many frames per second as possible,
323
23
324 IV Performance
always consuming 100% of the available processing power. Deliberately throttling
the frame rate to a more modest level and thus consuming less power can significantly
extend battery life while having relatively little impact on user experience. Of course,
this does not mean that one should stop optimizing after achieving the target frame
rate: further optimizations will then allow the system to spend more time idle and
hence improve power consumption.
The main focus of this chapter will be on OpenGL ES since that is the primary
market for tile-based GPUs, but occasionally I will touch on desktop OpenGL fea-
tures and how they might perform.
23.2 Background
While performance is the main goal for desktop GPUs, mobile GPUs must balance
performance against power consumption, i.e., battery life. One of the biggest con-
sumers of power in a device is memory bandwidth: computations are relatively cheap,
but the further data has to be moved, the more power it takes.
The OpenGL virtual pipeline requires a large amount of bandwidth. For a fairly
typical use-case, each pixel will require a read from the depth/stencil buffer, a write
back to the depth/stencil buffer, and a write to the color buffer, say 12 bytes of traffic,
assuming no overdraw, no blending, no multipass algorithms, and no multisampling.
With all the bells and whistles, one can easily generate over 100 bytes of memory
traffic for each displayed pixel. Since at most 4 bytes of data are needed per displayed
pixel, this is an excessive use of b andwidth and hence power. In reality, desktop GPUs
use compression techniques to reduce the bandwidth, but it is still significant.
To reduce this enormous bandwidth demand, many mobile GPUs use tile-based
rendering. At the most basic level, these GPUs move the framebuffer, including the
depth buffer, multisample buffers, etc., out of main memory and into high-speed
on-chip memory. Since this memory is on-chip, and close to where the computa-
tions occur, far less power is required to access it. If it were possible to place a large
framebuffer in on-chip memory, that would be the end of the story; but unfortu-
nately, that would take far too much silicon. The size of the on-chip framebuffer, or
tile buffer, varies between GPUs but can be as small as 16 × 16 pixels.
This poses some new challenges: how can a high-resolution image be produced
using such a small tile buffer? The solution is to break up the OpenGL framebuffer
into 16 × 16 tiles (hence the name tile-based rendering”) and render one at a time.
For each tile, all the primitives that affect it are rendered into the tile buffer, and once
the tile is complete, it is copied back to the more power-hungry main memory, as
shown in Figure 23.1. The bandwidth advantage comes from only having to write
back a minimum set of results: no depth/stencil values, no overdrawn pixels, and no
multisample buffer data. Additionally, depth/stencil testing and blending are done
entirely on-chip.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required