BUY THIS BOOK
Add to Cart

Print Book $39.99


Add to Cart

PDF $31.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £24.99

What is this?

Looking to Reprint or License this content?


Visualizing Data
Visualizing Data

By Ben Fry
Book Price: $39.99 USD
£24.99 GBP
PDF Price: $31.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: The Seven Stages of Visualizing Data
The greatest value of a picture is when it forces us to notice what we never expected to see.
—John Tukey
What do the paths that millions of visitors take through a web site look like? How do the 3.1 billion A, C, G, and T letters of the human genome compare to those of the chimp or the mouse? Out of a few hundred thousand files on your computer's hard disk, which ones are taking up the most space, and how often do you use them? By applying methods from the fields of computer science, statistics, data mining, graphic design, and visualization, we can begin to answer these questions in a meaningful way that also makes the answers accessible to others.
All of the previous questions involve a large quantity of data, which makes it extremely difficult to gain a "big picture" understanding of its meaning. The problem is further compounded by the data's continually changing nature, which can result from new information being added or older information continuously being refined. This deluge of data necessitates new software-based tools, and its complexity requires extra consideration. Whenever we analyze data, our goal is to highlight its features in order of their importance, reveal patterns, and simultaneously show features that exist across multiple dimensions.
This book shows you how to make use of data as a resource that you might otherwise never tap. You'll learn basic visualization principles, how to choose the right kind of display for your purposes, and how to provide interactive features that will bring users to your site over and over again. You'll also learn to program in Processing, a simple but powerful environment that lets you quickly carry out the techniques in this book. You'll find Processing a good basis for designing interfaces around large data sets, but even if you move to other visualization tools, the ways of thinking presented here will serve you as long as human beings continue to process information the same way they've always done.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Data Display Requires Planning
Each set of data has particular display needs, and the purpose for which you're using the data set has just as much of an effect on those needs as the data itself. There are dozens of quick tools for developing graphics in a cookie-cutter fashion in office programs, on the Web, and elsewhere, but complex data sets used for specialized applications require unique treatment. Throughout this book, we'll discuss how the characteristics of a data set help determine what kind of visualization you'll use.
When you hear the term "information overload," you probably know exactly what it means because it's something you deal with daily. In Richard Saul Wurman's book Information Anxiety (Doubleday), he describes how the New York Times on an average Sunday contains more information than a Renaissance-era person had access to in his entire lifetime.
But this is an exciting time. For $300, you can purchase a commodity PC that has thousands of times more computing power than the first computers used to tabulate the U.S. Census. The capability of modern machines is astounding. Performing sophisticated data analysis no longer requires a research laboratory, just a cheap machine and some code. Complex data sets can be accessed, explored, and analyzed by the public in a way that simply was not possible in the past.
The past 10 years have also brought about significant changes in the graphic capabilities of average machines. Driven by the gaming industry, high-end 2D and 3D graphics hardware no longer requires dedicated machines from specific vendors, but can instead be purchased as a $100 add-on card and is standard equipment for any machine costing $700 or more. When not used for gaming, these cards can render extremely sophisticated models with thousands of shapes, and can do so quickly enough to provide smooth, interactive animation. And these prices will only decrease—within a few years' time, accelerated graphics will be standard equipment on the aforementioned commodity PC.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An Example
To illustrate the seven steps listed in the previous section, and how they contribute to effective information visualization, let's look at how the process can be applied to understanding a simple data set. In this case, we'll take the zip code numbering system that the U.S. Postal Service uses. The application is not particularly advanced, but it provides a skeleton for how the process works. ( contains a full implementation of the project.)
All data problems begin with a question and end with a narrative construct that provides a clear answer. The Zipdecode project (described further in ) was developed out of a personal interest in the relationship of the zip code numbering system to geographic areas. Living in Boston, I knew that numbers starting with a zero denoted places on the East Coast. Having spent time in San Francisco, I knew the initial numbers for the West Coast were all nines. I grew up in Michigan, where all our codes were four-prefixed. But what sort of area does the second digit specify? Or the third?
The finished application was initially constructed in a few hours as a quick way to take what might be considered a boring data set (a long list of zip codes, towns, and their latitudes and longitudes) and create something engaging for a web audience that explained how the codes related to their geography.

Acquire

The acquisition step involves obtaining the data. Like many of the other steps, this can be either extremely complicated (i.e., trying to glean useful data from a large system) or very simple (reading a readily available text file).
A copy of the zip code listing can be found on the U.S. Census Bureau web site, as it is frequently used for geographic coding of statistical data. The listing is a freely available file with approximately 42,000 lines, one for each of the codes, a tiny portion of which is shown in .
Figure : Zip codes in the format provided by the U.S. Census Bureau
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Iteration and Combination
shows the stages in order and demonstrates how later decisions commonly reflect on earlier stages. Each step of the process is inextricably linked because of how the steps affect one another. In the Zipdecode application, for instance:
  • The need for a compact representation on the screen led me to refilter the data to include only the contiguous 48 states.
  • The representation step affected acquisition because after I developed the application I modified it so it could show data that was downloaded over a slow Internet connection to the browser. My change to the structure of the data allows the points to appear slowly, as they are first read from the data file, employing the data itself as a "progress bar."
  • Interaction by typing successive numbers meant that the colors had to be modified in the visual refinement step to show a slow transition as points in the display are added or removed. This helps the user maintain context by preventing the updates on-screen from being too jarring.
    Figure : Interactions between the seven stages
The connections between the steps in the process illustrate the importance of the individual or team in addressing the project as a whole. This runs counter to the common fondness for assembly-line style projects, where programmers handle the technical portions, such as acquiring and parsing data, and visual designers are left to choose colors and typefaces. At the intersection of these fields is a more interesting set of properties that demonstrates their strength in combination.
When acquiring data, consider how it can change, whether sporadically (such as once a month) or continuously. This expands the notion of graphic design that's traditionally focused on solving a specific problem for a specific data set, and instead considers the meta-problem of how to handle a certain kind of data that might be updated in the future.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Principles
I'll finish this general introduction to visualization by laying out some ways of thinking about data and its representation that have served me well over many years and many diverse projects. They may seem abstract at first, or of minor importance to the job you're facing, but I urge you to return and reread them as you practice visualization; they just may help you in later tasks.
A visualization should convey the unique properties of the data set it represents. This book is not concerned with providing a handful of ready-made "visualizations" that can be plugged into any data set. Ready-made visualizations can help produce a quick view of your data set, but they're inflexible commodity items that can be implemented in packaged software. Any bar chart or scatter plot made with Excel will look like a bar chart or scatter plot made with Excel. Packaged solutions can provide only packaged answers, like a pull-string toy that is limited to a handful of canned phrases, such as "Sales show a slight increase in each of the last five years!" Every problem is unique, so capitalize on that uniqueness to solve the problem.
Chapters in this book are divided by types of data, rather than types of display. In other words, we're not saying, "Here's how to make a bar graph," but "Here are several ways to show a correlation." This gives you a more powerful way to think about maximizing what can be said about the data set in question.
I'm often asked for a library of tools that will automatically make attractive representations of any given data set. But if each data set is different, the point of visualization is to expose that fascinating aspect of the data and make it self-evident. Although readily available representation toolkits are useful starting points, they must be customized during an in-depth study of the task.
Data is often stored in a generic format. For instance, databases used for annotation of genomic data might consist of enormous lists of start and stop positions, but those lists vary in importance depending on the situation in which they're being used. We don't view books as long abstract sequences of words, yet when it comes to information, we're often so taken with the enormity of the information and the low-level abstractions used to store it that the narrative is lost. Unless you stop thinking about databases, everything looks like a table—millions of rows and columns to be stored, queried, and viewed.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Onward
In this chapter, we covered the process for attacking the common modern problems of having too much data and having data that changes. In the next chapter, we'll discuss Processing, the software tool used to handle data sets in this book.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Getting Started with Processing
The Processing project began in the spring of 2001 and was first used at a workshop in Japan that August. Originally built as a domain-specific extension to Java targeted at artists and designers, Processing has evolved into a full-blown design and prototyping tool used for large-scale installation work, motion graphics, and complex data visualization. Processing is a simple programming environment that was created to make it easier to develop visually oriented applications with an emphasis on animation and provide users with instant feedback through interaction. As its capabilities have expanded over the past six years, Processing has come to be used for more advanced production-level work in addition to its sketching role.
Processing is based on Java, but because program elements in Processing are fairly simple, you can learn to use it from this book even if you don't know any Java. If you're familiar with Java, it's best to forget that Processing has anything to do with it for a while, at least until you get the hang of how the API works. We'll cover how to integrate Java and Processing toward the end of the book.
The latest version of Processing can be downloaded at:
An important goal for the project was to make this type of programming accessible to a wider audience. For this reason, Processing is free to download, free to use, and open source. But projects developed using the Processing environment and core libraries can be used for any purpose. This model is identical to GCC, the GNU Compiler Collection. GCC and its associated libraries (e.g., libc) are open source under the GNU Public License (GPL), which stipulates that changes to the code must be made available. However, programs created with GCC (examples too numerous to mention) are not themselves required to be open source.
Processing consists of:
  • The Processing Development Environment (PDE). This is the software that runs when you double-click the Processing icon. The PDE is an Integrated Development Environment with a minimalist set of features designed as a simple introduction to programming or for testing one-off ideas.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Sketching with Processing
A Processing program is called a sketch. The idea is to make Java-style programming feel more like scripting, and adopt the process of scripting to quickly write code. Sketches are stored in the sketchbook, a folder that's used as the default location for saving all of your projects. When you run Processing, the sketch last used will automatically open. If this is the first time Processing is used (or if the sketch is no longer available), a new sketch will open.
Sketches that are stored in the sketchbook can be accessed from File → Sketchbook. Alternatively, File → Open . . . can be used to open a sketch from elsewhere on the system.
Advanced programmers need not use the PDE and may instead use its libraries with the Java environment of choice. (This is covered toward the end of the book.) However, if you're just getting started, it's recommended that you use the PDE for your first few projects to gain familiarity with the way things are done. Although Processing is based on Java, it was never meant to be a Java IDE with training wheels. To better address our target audience, its conceptual model (how programs work, how interfaces are built, and how files are handled) is somewhat different from Java's.
Programming languages are often introduced with a simple program that prints "" to the console. The Processing equivalent is simply to draw a line:
line(15, 25, 70, 90);
Enter this example and press the Run button, which is an icon that looks like the Play button on any audio or video device. The result will appear in a new window, with a gray background and a black line from coordinate (15, 25) to (70, 90). The (0, 0) coordinate is the upper-lefthand corner of the display window. Building on this program to change the size of the display window and set the background color, type in the code from .
Example . Simple sketch
size(400, 400);
background(192, 64, 0);
stroke(255);
line(150, 25, 270, 350);
This version sets the window size to 400 × 400 pixels, sets the background to an orange-red, and draws the line in white, by setting the stroke color to 255. By default, colors are specified in the range 0 to 255. Other variations of the parameters to the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Exporting and Distributing Your Work
One of the most significant features of the Processing environment is its ability to bundle your sketch into an applet or application with just one click. Select File → Export to package your current sketch as an applet. This will create a folder named applet inside your sketch folder. Opening the index.html file inside that folder will open your sketch in a browser. The applet folder can be copied to a web site intact and will be viewable by users who have Java installed on their systems. Similarly, you can use File → Export Application to bundle your sketch as an application for Windows, Mac OS X, and Linux.
The applet and application folders are overwritten whenever you export—make a copy or remove them from the sketch folder before making changes to the index.html file or the contents of the folder.
More about the export features can be found in the reference; see http://processing.org/reference/environment/export.html.
If you don't want to distribute the actual project, you might want to create images of its output instead. Images are saved with the saveFrame( ) function. Adding saveFrame( ) at the end of draw( ) will produce a numbered sequence of TIFF-format images of the program's output, named screen-0001.tif, screen-0002.tif, and so on. A new file will be saved each time draw( ) runs. Watch out because this can quickly fill your sketch folder with hundreds of files. You can also specify your own name and file type for the file to be saved with a command like:
saveFrame("output.png")
To do the same for a numbered sequence, use #s (hash marks) where the numbers should be placed:
saveFrame("output-####.png");
For high-quality output, you can write geometry to PDF files instead of the screen, as described in the section "," later in this chapter.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Examples and Reference
While many programmers learn to code in school, others teach themselves. Learning on your own involves looking at lots of other code: running, altering, breaking, and enhancing it until you can reshape it into something new. With this learning model in mind, the Processing software download includes dozens of examples that demonstrate different features of the environment and API.
The examples can be accessed from the File → Examples menu. They're grouped into categories based on their functions (such as Motion, Typography, and Image) or the libraries they use (such as PDF, Network, and Video).
Find an interesting topic in the list and try an example. You'll see commands that are familiar, such as stroke( ), line( ), and background( ), as well as others that have not yet been covered. To see how a function works, select its name, and then right-click and choose Find in Reference from the pop-up menu (Find in Reference can also be found beneath the Help menu). That will open the reference for that function in your default web browser.
In addition to a description of the function's syntax, each reference page includes an example that uses the function. The reference examples are much shorter (usually four or five lines apiece) and easier to follow than the longer code examples.
The size( ) command also sets the global variables width and height. For objects whose size is dependent on the screen, always use the width and height variables instead of a number (this prevents problems when the size( ) line is altered):
size(400, 400);

// The wrong way to specify the middle of the screen
ellipse(200, 200, 50, 50);

// Always the middle, no matter how the size(  ) line changes
ellipse(width/2, height/2, 50, 50);
In the earlier examples, the size( ) command specified only a width and height for the new window. An optional parameter to the size( ) method specifies how graphics are rendered. A renderer handles how the Processing API is implemented for a particular output method (whether the screen, or a screen driven by a high-end graphics card, or a PDF file). Several renderers are included with Processing, and each has a unique function. At the risk of getting too far into the specifics, here are examples of how to specify them with the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Functions
The steps of the process outlined in the first chapter are commonly associated with specific functions in the Processing API. For instance:
Acquire
loadStrings( ), loadBytes( )
Parse
split( )
Filter
for( ), if (item[i].startsWith( ))
Mine
min( ), max( ), abs( )
Represent
map( ), beginShape( ), endShape( )
Refine
fill( ), strokeWeight( ), smooth( )
Interact
mouseMoved( ), mouseDragged( ), keyPressed( )
This is not an exhaustive list, but simply another way to frame the stages of visualization for those more familiar with code.
A library is a collection of code in a specified format that makes it easy to use within Processing. Libraries have been important to the growth of the project because they let developers make new features accessible to users without making them part of the core Processing API.
Several core libraries come with Processing. These can be seen in the Libraries section of the online reference (also available from the Help menu from within the PDE); see http://processing.org/reference/libraries.
One example is the XML import library. This is an extremely minimal XML parser (based on the open source project NanoXML) with a small download footprint (approximately 30KB) that makes it ideal for online use.
To use the XML library in a project, choose Sketch → Import Library → xml. This will add the following line to the top of the sketch:
import processing.xml.*;
Java programmers will recognize the import command. In Processing, this line also determines what code is packaged with a sketch when it is exported as an applet or application.
Now that the XML library is imported, you can issue commands from it. For instance, the following line loads an XML file named sites.xml into a variable named xml:
XMLElement xml = new XMLElement(this, "sites.xml");
The xml variable can now be manipulated as necessary to read the contents. The full example can be seen in the reference for its class, XMLElement
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Sketching and Scripting
Processing sketches are made up of one or more tabs, with each tab representing a piece of code. The environment is designed around projects that are a few pages of code, and often three to five tabs in total. This covers a significant number of projects developed to test and prototype ideas, often before embedding them into a larger project or building a more robust application for broader deployment.
This small-scale development style is useful for data visualization in two primary scenarios. The most common scenario is when you have a data set in mind, or a question that you're trying to answer, and you need a quick way to load the data, represent it, and see what's there. This is important because it lets you take an inventory of the data in question. How many elements are there? What are the largest and smallest values? How many dimensions are we looking at? We'll return to this notion of exploring data in future chapters.
In the second scenario, the desired outcome is known, but the correct means of representing the data and interacting with it have not yet been determined.
The idea of sketching is identical to that of scripting, except that you're not working in an interpreted scripting language, but rather gaining the performance benefit of compiling to Java class files. Of course, strictly speaking, Java itself is an interpreted language, but its bytecode compilation brings it much closer to the "metal" than languages such as JavaScript, ActionScript, Python, or Ruby.
Processing was never intended as the ultimate language for visual programming; instead, we set out to make something that was:
  • A sketchbook for our own work, simplifying the majority of tasks that we undertake
  • A programming environment suitable for teaching programming to a nontraditional audience
  • A stepping stone from scripting languages to more complicated or difficult languages such as full-blown Java or C++
At the intersection of these points is a tradeoff between speed and simplicity of use. If we didn't care about speed, it might make sense to use Python, Ruby, or many other scripting languages. That is especially true for the education side. If we didn't care about making a transition to more advanced languages, we'd probably avoid a C++ or Java-style syntax. But Java is a nice starting point for a sketching language because it's far more forgiving than C/C++ and also allows users to export sketches for distribution via the Web.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Ready?
In this chapter, we covered the basics of the Processing environment, as well as a bit of the philosophy behind the environment itself and the type of software built with the language. In the next chapter, we'll get started representing our first data set.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Mapping
This chapter covers the basics of reading, displaying, and interacting with a data set. As an example, we'll use a map of the United States, and a set of data values for all 50 states. Drawing such a map is a simple enough task that could be done without programming—either with mapping software or by hand—but it gives us an example upon which to build. The process of designing with data involves a great deal of iteration: small changes that help your project evolve in usefulness and clarity. And as this project evolves through the course of the chapter, it will become clear how software can be used to create representations that automatically update themselves, or how interaction can be used to provide additional layers of information.
Some development environments separate work into projects; the equivalent term for Processing is a sketch. Start a new Processing sketch by selecting File → New.
For this example, we'll use a map of the United States to use as a background image. The map can be downloaded from http://benfry.com/writing/map/map.png.
Drag and drop the map.png file into the Processing editor window. A message at the bottom will appear confirming that the file has been added to the sketch. You can also add files by selecting Sketch → Add File. A sketch is organized as a folder, and all data files are placed in a subfolder named data. (The data folder is covered in .)
Then, enter the following code:
PImage mapImage;

void setup(  ) {
  size(640, 400);
  mapImage = loadImage("map.png");
}
void draw(  ) {
  background(255);
  image(mapImage, 0, 0);
}
Finally, click the Run button. Assuming everything was entered correctly, a map of the United States will appear in a new window.
Processing API functions are named to make their uses as obvious as possible. Method names, such as loadImage( ), convey the purpose of the calls in simple language. What you may need to get used to is dividing your code into functions such as setup( ) and
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Drawing a Map
Some development environments separate work into projects; the equivalent term for Processing is a sketch. Start a new Processing sketch by selecting File → New.
For this example, we'll use a map of the United States to use as a background image. The map can be downloaded from http://benfry.com/writing/map/map.png.
Drag and drop the map.png file into the Processing editor window. A message at the bottom will appear confirming that the file has been added to the sketch. You can also add files by selecting Sketch → Add File. A sketch is organized as a folder, and all data files are placed in a subfolder named data. (The data folder is covered in .)
Then, enter the following code:
PImage mapImage;

void setup(  ) {
  size(640, 400);
  mapImage = loadImage("map.png");
}
void draw(  ) {
  background(255);
  image(mapImage, 0, 0);
}
Finally, click the Run button. Assuming everything was entered correctly, a map of the United States will appear in a new window.
Processing API functions are named to make their uses as obvious as possible. Method names, such as loadImage( ), convey the purpose of the calls in simple language. What you may need to get used to is dividing your code into functions such as setup( ) and draw( ), which determine how the code is handled. After clicking the Run button, the setup( ) method executes once. After setup( ) has completed, the draw( ) method runs repeatedly. Use the setup( ) method to load images, fonts, and set initial values for variables. The draw( ) method runs at 60 frames per second (or slower if it takes longer than 1/60th of a second to run the code inside the draw( ) method); it can be used to update the screen to show animation or respond to mouse movement and other types of input.
Our first function calls are very basic. The loadImage( ) function reads an image from the data folder (URLs or absolute paths also work). The PImage class is a container for image data, and the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Locations on a Map
The next step is to specify some points on the map. To simplify this, a file containing the coordinates for the center of each state can be found at http://benfry.com/writing/map/locations.tsv.
In future chapters, we'll explore how this data is read. In the meantime, some code to read the location data file can be found at http://benfry.com/writing/map/Table.pde.
Add both of these files to your sketch the same way that you added the map.png file earlier.
The Table class is just two pages of code, and we'll get into its function later. In the meantime, suffice it to say that it reads a file as a grid of rows and columns. The class has methods to get an int, float, or String for a specific row and column. To get float values, for instance, use the following format:
table.getFloat(row, column)
Rows and columns are numbered starting at zero, so the column titles (if any) will be row 0, and the row titles will be column 0.
In the previous section, we saw how displaying a map in Processing is a two-step process:
  1. Load the data.
  2. Display the data in the desired format.
Displaying the centers of states follows the same pattern, although a little more code is involved:
  1. Create locationTable and use the locationTable.getFloat( ) function to read each location's coordinates (x and y values).
  2. Draw a circle using those values. Because a circle, geometrically speaking, is just an ellipse whose width and height are the same, graphics libraries provide an ellipse-drawing function that covers circle drawing as well.
A new version of the code follows, with modifications highlighted:
PImage mapImage;Table locationTable;
int rowCount;

void setup(  ) {
  size(640, 400);
  mapImage = loadImage("map.png");
  // Make a data table from a file that contains
  // the coordinates of each state.
  locationTable = new Table("locations.tsv");
  // The row count will be used a lot, so store it globally.
  rowCount = locationTable.getRowCount(  );
}

void draw(  ) {
  background(255);
  image(mapImage, 0, 0);

  
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Data on a Map
Next we want to load a set of values that will appear on the map itself. For this, we add another Table object and load the data from a file called random.tsv, available at http://benfry.com/writing/map/random.tsv.
It's always important to find the minimum and maximum values for the data, because that range will need to be mapped to other features (such as size or color) for display. To do this, use a for loop to walk through each line of the data table and check to see whether each value is bigger than the maximum found so far, or smaller than the minimum. To begin, the dataMin variable is set to MAX_FLOAT, a built-in value for the maximum possible float value. This ensures that dataMin will be replaced with the first value found in the table. The same is done for dataMax, by setting it to MIN_FLOAT. Using 0 instead of MIN_FLOAT and MAX_FLOAT will not work in cases where the minimum value in the data set is a positive number (e.g., 2.4) or the maximum is a negative number (e.g., −3.75).
The data table is loaded in the same fashion as the location data, and the code to find the minimum and maximum immediately follows:
PImage mapImage;
Table locationTable;
int rowCount;
Table dataTable;
float dataMin = MAX_FLOAT;
float dataMax = MIN_FLOAT;

void setup(  ) {
  size(640, 400);
  mapImage = loadImage("map.png");
  locationTable = new Table("locations.tsv");
  rowCount = locationTable.getRowCount(  );

// Read the data table.
  dataTable = new Table("random.tsv");

  // Find the minimum and maximum values.
  for (int row = 0; row < rowCount; row++) {
    float value = dataTable.getFloat(row, 1);
    if (value > dataMax) {
      dataMax = value;
    }
    if (value < dataMin) {
      dataMin = value;
    }
  }
}
The other half of the program (shown later) draws a data point for each location. A drawData( ) function is introduced, which takes x and y coordinates as parameters, along with an abbreviation for a state. The drawData( ) function grabs the float value from column 1 based on a state abbreviation (which can be found in column 0).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using Your Own Data
The file format presented in this chapter is straightforward, so try replacing the random.tsv file with your own data based on the 50 states. It's remarkably easy to plot your own values to individual locations. You'll probably still use the map( ) function, but you don't have to use ellipses or colors to plot your data points. You could draw an image at each location, varying its size based on the data. Or some points could be hidden or reorganize themselves in various ways. The points might refer to anything from chain coffee shops per capita to poverty levels in each state.
Not everyone wants to employ data relating to the United States, but the same technique is sound for any type of data mapped to particular points. In later chapters, we'll get into mapping latitude and longitude coordinates, as well as using shape data for locations, but even the simple example presented in this chapter can be used in many other ways.
The following code reads from the names.tsv file and asks the user to indicate a location for each in turn, by clicking the mouse where the user wants the data to be placed. Start this example as a separate sketch. It requires a map.png file, a names.tsv file, and the Table.pde file used throughout this chapter. The map image and names file can be replaced with data of your choice, and this code produces a locations.tsv file that can be added to the data folder of the new sketch:
PImage mapImage;
Table nameTable;

int currentRow = −1;
PrintWriter writer;

void setup(  ) {
  size(640, 400);
  mapImage = loadImage("map.png");
  nameTable = new Table("names.tsv");
  writer = createWriter("locations.tsv");
  cursor(CROSS);  // make easier to pinpoint a location
  println("Click the mouse to begin.");
}

void draw(  ) {
  image(mapImage, 0, 0);
}

void mousePressed(  ) {
  if (currentRow != −1) {
    String abbrev = nameTable.getRowName(currentRow);
    writer.println(abbrev + "\t" + mouseX + "\t" + mouseY);
  }

  currentRow++;
  if (currentRow == nameTable.getRowCount(  )) {
    // Close the file and finish.
    writer.flush(  );
    writer.close(  );
    exit(  );
  } else {
    // Ask for the next coordinate.
    String name = nameTable.getString(currentRow, 1);
    println("Choose location for " + name + ".");
  }
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Next Steps
In this chapter, we learned the basics of reading, displaying, and interacting with a data set. The chapters that follow delve into far more sophisticated aspects of each, but all of them build on the basic skills you've picked up here.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Time Series
The time series is a ubiquitous type of data set. It describes how some measurable feature (for instance, population, snowfall, or items sold) has changed over a period of time. Edward Tufte credits Johann Heinrich Lambert with the formal introduction of the time series to scientific literature in the 1700s.
Because of its ubiquity, the time series is a good place to start when learning about visualization. With it we can cover:
  • Acquiring a table of data from a text file
  • Parsing the contents of the file into a usable data structure
  • Calculating the boundaries of the data to facilitate representation
  • Finding a suitable representation and considering alternatives
  • Refining the representation with consideration for placement, type, line weight, and color
  • Providing a means of interacting with the data so that we can compare variables against one another or against the average of the whole data set
For a straightforward data set, let's turn to the U.S. Department of Agriculture (USDA) for statistics on beverage consumption. Government sites are a terrific resource for data; see for more information about them and other sources of data.
Most methods will be implemented "by hand" in this section. Further down the line, we'll make generalized code to handle different scenarios, such as reading a table from a file or placing labels and grid lines on a plot.
The data set we use was originally downloaded from http://www.ers.usda.gov/data/foodconsumption/foodavailqueriable.aspx.
The page lets you define a query to download a data set of interest. The site claims that the data is in Excel format, but a glance at the contents of the resulting file shows that it's only an HTML file with an .xls extension that fools Excel into opening it. Rather than getting into the specifics of how to download and clean the data, I offer an already processed version here:
This data set contains three columns: the first for milk, the second for coffee, and the third for tea consumption in the United States from 1910 to 2004.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Milk, Tea, and Coffee (Acquire and Parse)
The data set we use was originally downloaded from http://www.ers.usda.gov/data/foodconsumption/foodavailqueriable.aspx.
The page lets you define a query to download a data set of interest. The site claims that the data is in Excel format, but a glance at the contents of the resulting file shows that it's only an HTML file with an .xls extension that fools Excel into opening it. Rather than getting into the specifics of how to download and clean the data, I offer an already processed version here:
This data set contains three columns: the first for milk, the second for coffee, and the third for tea consumption in the United States from 1910 to 2004.
To read this file, use this modified version of the Table class from the previous chapter:
The modified version handles data stored as float values, making it more efficient than the previous version, which simply converted the data whenever getString( ), getFloat( ), or getInt( ) were used.
Open Processing and start a new sketch. Add both files to the sketch by either dragging each into the editor window or using Sketch → Add File.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Cleaning the Table (Filter and Mine)
It's necessary to determine the minimum and maximum of each of the columns in the pre-filtered data set. These values are used to properly scale plotted points to locations on the screen.
The FloatTable class has methods for calculating the min and max for the rows and columns. These methods are worth discussing because they are important in later code. The following example calculates the minimum value for a column (comments denote important portions of the code):
  float getColumnMax(int col) {
    // Set the value of m arbitrarily high, so the first value
    // found will be set as the maximum.
    float m = MIN_FLOAT;

    // Loop through each row.
    for (int row = 0; row < rowCount; row++) {

      // Only consider valid data elements (see later text).
      if (isValid(row, col)) {
        // Finally, check to see if the value
        // is greater than the maximum found so far.
        if (data[row][col] > m) {
          m = data[row][col];
        }
      }
    }
    return m;
  }
The isValid( ) method is important because most data sets have incomplete data. In the milk-tea-coffee.tsv file, all of the data is valid, but in most data sets (including others used in this chapter), missing values require extra consideration.
Because the values for milk, coffee, and tea will be compared against one another, it's necessary to calculate the maximum value across all of the columns. The following bit of code does this after loading the milk-tea-coffee.tsv file:
FloatTable data;
float dataMin, dataMax;

void setup(  ) {
  data = new FloatTable("milk-tea-coffee.tsv");

  dataMin = 0;
  dataMax = data.getTableMax(  );
}
Sometimes, it's also useful to calculate the minimum value, but setting the minimum to zero provides a more accurate comparison between the three data sets. The minimum for this data set is 5.1, and the values for the tea column hover around 6, so using 5.1 as the dataMin value would produce a chart that looked as though the beverage history included periods of no (or nearly no) tea consumption in the U.S. In addition, if the value is 6, it's important that the relative difference seen by the viewer is not just 0.9, but that it shows the full range from 0 up to 5.1 and how it compares to a value of 6.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Simple Plot (Represent and Refine)
To begin the representation, it's first necessary to set the boundaries for the plot location. The plotX1, plotY1, plotX2, and plotY2 variables define the corners of the plot. To provide a nice margin on the left, set plotX1 to 50, and then set the plotX2 coordinate by subtracting this value from width. This keeps the two sides even, and requires only a single change to adjust the position of both. The same technique is used for the vertical location of the plot:
FloatTable data;
float dataMin, dataMax;float plotX1, plotY1;
float plotX2, plotY2;

int yearMin, yearMax;
int[] years;


void setup(  ) {
  size(720, 405);

  data = new FloatTable("milk-tea-coffee.tsv");

  years = int(data.getRowNames(  ));
  yearMin = years[0];
  yearMax = years[years.length - 1];

  dataMin = 0;
  dataMax = data.getTableMax(  );

  // Corners of the plotted time series
  plotX1 = 50;
  plotX2 = width - plotX1;
  plotY1 = 60;
  plotY2 = height - plotY1;

  smooth(  );
}
Next, add a draw( ) method that sets the background to a light gray and draws a filled white rectangle for the plotting area. That will make the plot stand out against the background, rather than a color behind the plot itself—which can muddy its appearance.
The rect( ) function normally takes the form rect(x, y, width, height), but rectMode(CORNERS) changes the parameters to rect(left, top, right, bottom), which is useful because our plot's shape is defined by the corners. Like other methods that affect drawing properties, such as fill( ) and stroke( ), rectMode( ) affects all geometry that is drawn after it until the next time rectMode( ) is called:
void draw(  ) {
  background(224);

  // Show the plot area as a white box.
  fill(255);
  rectMode(CORNERS);
  noStroke(  );
  rect(plotX1, plotY1, plotX2, plotY2);

  strokeWeight(5);
  // Draw the data for the first column.
  stroke(#5679C1);
  drawDataPoints(0);
}


// Draw the data as a series of points.
void drawDataPoints(int col) {
  int rowCount = data.getRowCount(  );
  for (int row = 0; row < rowCount; row++) {
    if (data.isValid(row, col)) {
      float value = data.getFloat(row, col);
      float x = map(years[row], yearMin, yearMax, plotX1, plotX2);
      float y = map(value, dataMin, dataMax, plotY2, plotY1);
      point(x, y);
    }
  }
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Labeling the Current Data Set (Refine and Interact)
Missing from the previous code is an indicator of the currently visible column of data (whether milk, tea, or coffee) and a means to swap between each of the three. For this, we add a variable to keep track of the current column, and another for the font used for the title. And few lines of code are added to the draw( ) method to write the name of the column with the text( ) method:
FloatTable data;
float dataMin, dataMax;

float plotX1, plotY1;
float plotX2, plotY2;int currentColumn = 0;
int columnCount;

int yearMin, yearMax;
int[] years;

PFont plotFont;


void setup(  ) {
  size(720, 405);
  data = new FloatTable("milk-tea-coffee.tsv");
  columnCount = data.getColumnCount(  );

  years = int(data.getRowNames(  ));
  yearMin = years[0];
  yearMax = years[years.length - 1];

  dataMin = 0;
  dataMax = data.getTableMax(  );

  // Corners of the plotted time series
  plotX1 = 50;
  plotX2 = width - plotX1;
  plotY1 = 60;
  plotY2 = height - plotY1;

  plotFont = createFont("SansSerif", 20);
  textFont(plotFont);

  smooth(  );
}


void draw(  ) {
  background(224);

  // Show the plot area as a white box.
  fill(255);
  rectMode(CORNERS);
  noStroke(  );
  rect(plotX1, plotY1, plotX2, plotY2);

  // Draw the title of the current plot.
  fill(0);
  textSize(20);
  String title = data.getColumnName(currentColumn);
  text(title, plotX1, plotY1 - 10);

  stroke(#5679C1);
  strokeWeight(5);
  drawDataPoints(currentColumn);
}
The text( ) line draws the text 10 pixels above plotY1, which represents the top of the plot, and the drawDataPoints( ) line uses currentColumn instead of just 0. Results are shown in .
Figure : Time series with data set labeled
The createFont( ) function is used to create a font from one of the built-in typefaces. The built-in typefaces are Serif, SansSerif, Monospaced, Dialog, and DialogInput; they map to the default fonts on each operating system. On Mac OS X, for instance, SansSerif maps to Lucida Sans, whereas on Windows it maps to Arial. The default fonts are useful when you don't want to deal with the Create Font tool, but the font choices are not particularly inspiring, and they don't guarantee consistent output across different operating systems. For instance, making pixel-level decisions with a built-in font is a bad idea because the shaping and spacing of the characters can be significantly different on other operating systems.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Drawing Axis Labels (Refine)
An unlabeled plot has minimal utility. It clearly displays relative up or down swings, but without a sense of the time period or amounts to indicate the degree of swing, it's impossible to know whether values have changed by, say, 5% or 50%. And some indication is required to explain that the horizontal axis represents the year and the vertical axis represents actual volumes: the amount consumed of a particular beverage, measured in gallons per capita per year.
There are clever (and complicated) means of selecting intervals, but for this project, we will pick the interval by hand. Choosing a proper interval and deciding whether to include major and minor tick marks depends on the data, but a general rule of thumb is that five intervals is at the low end, and more than ten is likely a problem. Too many labels make the diagram look like graph paper, and too few suggests that only the minimum and maximum values need to be shown.
The most important consideration is the way the data is used. Are minute, year-by-year comparisons needed? Always use the fewest intervals you can get away with, as long as the plot shows the level of detail the reader needs. Sometimes no labels are necessary—if values are only meant to be compared against one another. For instance, you might dispense with labels if you want to show only upward and downward trends. Other factors, such as the width of the plot, also play a role, so determining the correct level of detail usually requires a little trial and error.
Creating the year axis is straightforward. The data ranges from 1910 to 2004, so an interval of 10 years means marking 10 individual years: 1910, 1920, 1930, and so on, up to 2000. Add the yearInterval variable to the beginning of the code before setup( ):
int yearInterval = 10;
Next, add the following function to draw the year labels:
void drawYearLabels(  ) {
  fill(0);
  textSize(10);
  textAlign(CENTER, TOP);
  for (int row = 0; row < rowCount; row++) {
    if (years[row] % yearInterval == 0) {
      float x = map(years[row], yearMin, yearMax, plotX1, plotX2);
      text(years[row], x, plotY2 + 10);
    }
  }
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Choosing a Proper Representation (Represent and Refine)