“We are the safest large city in America, but any crime rate is too high.”
But not all boroughs are created equal. As we saw in Chapter 1, they have different surface areas and different numbers of people living in them. The fact that they produce different amounts of garbage isn’t surprising, and probably doesn’t make for a very interesting message to communicate to an audience.
These absolute comparisons, while helpful for showing the big picture, aren’t generally considered “apples-to-apples” comparisons. There are apples, and, well, there are Big Apples.
A ratio is a comparison of two terms expressed as a quotient. For example, Manhattan produced 0.264 tons of recycle for every ton of refuse. Ratios can be expressed as “x to y,” “x:y,” “x/y,” or as a decimal.
A rate is a ratio in which the two terms have different units. For example, the population density of Brooklyn is 36,136 residents per square mile. Rates are often predictive because time can be used as the denominator (crime rates, population growth rates, etc.).
A proportion is a ratio in which the numerator is a partial amount and the denominator is the total amount (expressed as a number between 0 and 1). For example, the proportion of the NYC population living in the Bronx is 0.169. A proportion is expressed as a number between 0 and 1.
A percentage is a ratio comparing a number to 100. For example, 16.9% of NYC residents live in the Bronx. A percentage is generally a number between 0 and 100, but can be larger than 100 (e.g., “sales have increased by 150% year-over-year”).
These types of normalized comparisons can make for much more interesting messages to communicate. Luckily, they are quite easy to create in Tableau. We’ll explore the first two in this chapter, and the last two in the next.
The data source itself may include a field that is a ratio, but most often the raw data includes terms that an analyst needs to combine to create a ratio. Tableau can handle these types of computations with Calculated Fields, in which new fields can be created by combining existing fields using a variety of operations. We introduced the concept in Chapter 2, and now we’ll build on it.
Let’s return to the DSNY data set to illustrate the technique, and we’ll see if the ratio comparison is any more interesting than the absolute figures.
The data reported by DSNY includes three continuous Measures (quantitative fields) for each community district:
Amount of nonrecyclable garbage
Amount of recyclable paper
Amount of recyclable metal, glass, and plastic
The second two data types (paper and MGP) are both recyclable, and
so can be summed to create a new variable for each community district that
we will call
stay consistent with the provided nomenclature.
First, we’ll need to create a new calculated field that sums the two recyclable fields. Right-click anywhere in the lefthand Dimensions or Measures panels, and select Create Calculated Field. Then, fill out the resulting dialog box, as shown in Figure 4-1.
Next, let’s create a second calculated field that creates the ratio of recyclable material to refuse. Right-click in the Dimensions or Measures area again and fill out the dialog box as shown in Figure 4-2.
Notice that we put the numerator and denominator in brackets,
SUM. This is critical, because as we move up a level of from community
district to borough, we don’t want to add the
individual ratios of each community district; we want to create a new
ratio that represents the quotient of the aggregate amounts. The first
approach would create a larger ratio that is misleading.
Now that we’ve created the ratios, how can we best visualize them?
Let’s start with a simple bar chart of the aggregate ratios of each of the
boroughs. To create one, we’ll drag the
Borough field from the Dimensions area on the
left to the Rows shelf, and we’ll drag the
Recycle to Refuse Ratio calculated field to the
Columns shelf. Color wouldn’t strictly be needed, but we’ll add it anyway
for aesthetics by dragging
Borough to the Color shelf. Lastly, we’ll sort the bars in descending order to put the borough
with the highest recycle ratio at the top. The screen shot in Figure 4-3 shows the resulting view.
We can easily see from this simple bar chart that Manhattan has the highest ratio of recycle to refuse, but that it still only recycles about one ton of material for every four tons of trash. The ratio for Manhattan is about twice as great as the ratio for the Bronx.
Does this mean each community district in Manhattan had a higher recycle ratio than each community district in the Bronx? Not necessarily, as we’re dealing with the terms in the aggregate.
How would we communicate the results at a community district level?
If we just replace
CommunityDistrict in the Rows shelf,
then we’ll get a bar chart where all the community district #1s will be
aggregated, and all the community district #2s will be aggregated, and so on. Aggregating community districts
in this way isn’t very meaningful.
Instead, we’d like to see community district ratios for each
To make this comparison, let’s click and drag the
Borough pill from the Rows shelf up to the
Columns shelf, and add the
CommunityDistrict field from the Measures panel
to the Rows shelf to create a grid of bar charts, as shown in Figure 4-4.
From this view, we can now see that even within boroughs, not all community districts are created equal. Manhattan community district #1 has the highest ratio overall, but it’s clear that some community districts in the Bronx have higher ratios than some in Manhattan.
But what if we wanted to know the exact ratios? The bar charts in Figure 4-4 make it fairly easy to get a general sense of the relative ratios for the community districts, but precision isn’t easily obtained, is it? Can you tell the exact ratio for Manhattan community district #1? I can’t. If knowing the precise amounts is a task our audience will want to perform, we’ll need to think about how to make it easy for them to accomplish that task.
We could increase precision by dragging the
Recycle to Refuse Ratio from the Measures area
to the Label shelf (or Ctrl-selecting
AGG(Recycle to Refuse Ratio) and dragging it
from the Columns shelf to the Label shelf to duplicate). The
resulting bar chart grid with labels is shown in Figure 4-5.
That’s a lot of decimal places, isn’t it? Our audience probably
doesn’t need to know the ratio to the fourth decimal place
(ten-thousandths). We can change the number of digits shown by
to Refuse Ratio in the Measures panel, selecting Default Properties, then Number Format, and choosing Number (Custom). In the resulting dialog box,
we’ll change the decimal places to three, as shown in Figure 4-6.
Notice that the labels in the bar chart grid change accordingly.
What we really have here is a table that has been augmented by adding bars with lengths proportional to the values in each cell. Another way to show this precise comparison is by getting rid of the bars and instead coloring each cell by the ratio—a highlight table.
Let’s make a highlight table by starting with a new sheet.
Recycle to Refuse Ratio, and then open the
Show Me panel and select highlight table. Tableau creates the view shown
in Figure 4-7.
Notice that Tableau created a highlight table that’s “flipped”
compared to our bar chart grid shown in Figure 4-5 (
Borough is in Rows instead of Columns, and
CommunityDistrict is in Columns instead
of Rows). To flip it to match, we’ll just click the Swap button in the toolbar, , resulting in the taller table shown in Figure 4-8.
Tableau has defaulted to a sequential green color palette, meaning
the gradient increases proportionately with increasing
Recycle to Refuse Ratio. Our eyes can easily
pick out the darkest cell, and we also find the lightest rather easily.
Most people can distinguish between about six different levels of
intensity. If the goal were to communicate the precise values, we could
have created a simple table without the highlights. Adding the highlights
makes it easier to quickly compare cells and find the highest and lowest
It’s still not immediately obvious which community district has the second-highest ratio, though. Our eyes have to dart around to all the dark cells and read and memorize the values. Eventually we settle on Brooklyn community district #6 as having the second-highest ratio, but the task gets harder as we move along. Which has the third highest? The fourth? Is there an easier way to facilitate this assessment of rank?
What we’d like is a simple list of community districts in descending
order of the ratio. To make this list, let’s first create a new field that
CommunityDistrict fields into a single
CommunityDistrict so they are both selected,
then right-click on one of the two selected fields (either will do fine)
and select Combine Fields. A new
string field will appear in the Dimensions area called
Borough & CommunityDistrict
Create the list shown in Figure 4-9 by dragging
Borough & CommunityDistrict to the Rows
Recycle to Refuse Ratio to the
Columns shelf, and
Borough to the Color shelf, sorting in
Now we can quickly tell which are the top four or five districts, and which are the bottom four or five. Using the colored bars, we also can see some patterns, like that the districts with the lowest ratio are all in the Bronx (blue), and most of the top ten are in Manhattan (green).
But if I asked you to pick out the tenth or the fifteenth districts, you’d have to count down from the top, right? Not the easiest task to perform. How can we add rank to this list to easily pick out the order from 1 to 59?
Let’s create a
Rank field. To
do so, right-click anywhere in the Dimensions or Measures panel
on the left and select Create a Calculated
Field once again. This time, name the field
Rank and enter
INDEX(). As soon
as the message appears in the bottom left indicating “the calculation is
valid,” blue text will also appear in the top right that says “Default
Table Calculation.” Click on this blue text and change the Compute using drop-down to Borough & CommunityDistrict (Combined),
as shown in Figure 4-10.
This tells Tableau that the rank should apply to the combination
After clicking OK twice, we
now see a new field in the Measures panel called
Rank. Notice that this field is continuous
# symbol is green), but we can consider the rankings to be discrete integers,
because there is no rank between any two successive ranks. To change the
Rank data field type, right-click on
it in the Measures panel and select Convert to
Rank will turn blue, indicating that it’s discrete. We’re now ready
Rank to the chart.
To add the
Rank field to the
bar chart, simply drag it to the left of
Borough & CommunityDistrict (Combined) on
the Rows shelf. After removing the row dividers, we have the colored, sorted, and now ranked bar chart
shown in Figure 4-11.
Alternatively, since the launch of Version 8.1, users can create a
Rank field by making use of
the new Rank Table Calculation. To do so, drag another instance of
Recycle to Refuse Ratio to the
columns shelf, click in the down arrow of the new pill, and change it to
Discrete, and then click the down
arrow again and select Quick Table
Calculation and then Rank. Finally, click the down arrow once more
and select Compute using, then
choose Borough CommunityDistrict
(combined). (This calcuated field also lets you handle ties
in different ways.)
Now the only task this chart doesn’t facilitate is gleaning
precise values. We could add the labels to the right of the bars as we
did in Figure 4-5 by simply
Ratio field to the Label
shelf, but perhaps a better approach would be to add the labels to the
left of the bars so as not to interfere with the perception of the bars’
Recycle to Refuse Ratio
is a continuous data field (the
symbol is green), if we just drag it out onto the Rows shelf to the
CommunityDistrict, Tableau will create 59 y-axes for each of
the bars. Try it and see what I mean.
We just want the ratio value itself to appear, so we’ll first
Recycle to Refuse Ratio to a
discrete field the same way we converted
Rank, and then we’ll drag the ratio onto the
Rows shelf to the far right, leaving a blue pill and creating the
updated view shown in Figure 4-12.
We’ve now visualized the ratio in a way that makes it easy for our audience to perform a number of different tasks very quickly:
Get a general sense of how the 59 community districts compare relative to one another
Determine which districts have the highest ratios, and which have the lowest
Get a general sense of which boroughs have districts at the top of the list, and which have districts at the bottom
Know the precise ratio for any district
Pick out any district in rank order (e.g., the 12th or the 28th)
This view is a very versatile and useful way to communicate the ratio of recycle to refuse. But it doesn’t tell us which community districts produced more trash overall per person. For all we know, the districts at the top may be recycling a lot, but they may also be producing a much higher amount per person. That’s where rates come into play.
Recall that a rate is just a special type of ratio in which the numerator and the denominator have different units. In the previous example, the ratio of recycle to refuse involved terms with the same units—namely, tons. When we consider the trash production rate per person, our numerator is still tons, but now our denominator is the number of people.
But the data set we have been working with so far doesn’t include community district population. In order to determine trash production per person, we’ll need to find a way to include population data in our analysis.
Going back to the data discovery process in Figure 1-1, we’ve gone all the way around the horse track, and now we have a brand-new question that involves gathering new data.
Ideally, we would find population by community district for September 2011, which is when the DSNY collection data was taken. We can’t find that, but we came come close: 2010 census data puts us within a year’s time of the trash collection data, and is probably suitable to get a reasonable approximation of the trash production rate.
Now that we found population data, how can we create rates with it? Luckily, Tableau allows us to connect our current workbook to this new data set and “blend” it with the existing data in the workbook.
From the Data menu of our current workbook, select Connect to Data and then Microsoft Excel, and browse to the population data file, connecting to the correct sheet in the file and importing the data as an extract. We now see the new data source appear in the Data area along with the DSNY data, as shown in Figure 4-13.
The newly imported data set is highlighted in the Data area, and the fields for that data set are shown in the Dimensions and Measures panels below. If we click on the DSNY data set, the data fields for that data set will show below.
is in the Dimensions panel of the DSNY set, but it’s in the Measures
area of the new population data source. Let’s match them up by dragging
CommunityDistrict from the Measures
panel of the population table into the Dimensions area.
Because both data sources now have a Dimension called
Borough and a Dimension called
CommunityDistrict, Tableau will link them
together for us. If the fields to link had different names, we would
have to manually link them using Data → Edit
Now that we have linked these two data sources, our updated Tableau workbook contains both terms of the refuse per person ratio: the measurement of the amount of refuse in tons (the numerator) and the count of the number of people in each community district (the denominator). We’ll create the rate the same way we created the ratio in the previous section, but now we need to create a calculated field that includes fields from two different sources, as shown in Figure 4-14.
type for both terms in the rate means that we can get
meaningful rates for boroughs as well as community districts. Also,
notice we are converting from short tons (US) to pounds (lbs.) by
multiplying the numerator by 2,000—which is how many pounds are in each
ton. On a per person basis, it’s much easier for us to think in terms of
pounds, so this is a better unit to use for this comparison.
Now that we have created the rate, we can visualize it. We’ll follow the same steps that we used to create the ratio bar chart in Figure 4-12 to create the rate comparison bar chart shown in Figure 4-15.
It’s clear from this view that the three community districts in Staten Island produced the most refuse per person in September 2011, at a rate of more than 70 pounds per person.
A word of caution about these results: often, data visualization and data analysis are best used to propose new questions to ask. Data is great at helping us make comparisons, but it doesn’t always answer “why?” and “how?” It can be easy for us to jump to conclusions based on what the data shows, but some further qualitative investigation is frequently required.
Why are the Staten Island community districts at the top of this chart? The chart itself doesn’t tell us. It helps us formulate the next question. Does this mean every person who lives on Staten Island produced 70 to 80 pounds of trash that month? No, that’s not what the data shows. The data just shows how much trash was collected from their communities. DSNY provided overall refuse collected, not just for residences, so perhaps there are industries or businesses on Staten Island that produced a lot of trash.
In this chapter, we learned how to use Calculated Fields and blended data sources to create ratios (quotients) and rates (quotients with mixed units). We also learned the Index function as well as the Rank table calculation, and created highlight tables and bar charts to compare different dimensions—at both the borough and community district levels. This enabled us to go beyond overall amounts (“how much”) and to consider relative amounts (“how much per”).