“We are the safest large city in America, but any crime rate is too high.”

—Michael Bloomberg, former mayor of New York City

Until now, we’ve considered absolute amounts of countable and measurable nouns, like population, tons of garbage collected, and number of rat sightings by borough. Pleasant, I know.

But not all boroughs are created equal. As we saw in Chapter 1, they have different surface areas and different numbers of people living in them. The fact that they produce different amounts of garbage isn’t surprising, and probably doesn’t make for a very interesting message to communicate to an audience.

These absolute comparisons, while helpful for showing the big picture, aren’t generally considered “apples-to-apples” comparisons. There are apples, and, well, there are Big Apples.

A helpful way to “normalize” comparisons is using ratios, rates, proportions, and percentages. What’s the difference between these four?

A

*ratio*is a comparison of two terms expressed as a quotient. For example, Manhattan produced 0.264 tons of recycle for every ton of refuse. Ratios can be expressed as “x to y,” “x:y,” “x/y,” or as a decimal.A

*rate*is a ratio in which the two terms have different units. For example, the population density of Brooklyn is 36,136 residents per square mile. Rates are often predictive because time can be used as the denominator (crime rates, population growth rates, etc.).A

*proportion*is a ratio in which the numerator is a partial amount and the denominator is the total amount (expressed as a number between 0 and 1). For example, the proportion of the NYC population living in the Bronx is 0.169. A proportion is expressed as a number between 0 and 1.A

*percentage*is a ratio comparing a number to 100. For example, 16.9% of NYC residents live in the Bronx. A percentage is generally a number between 0 and 100, but can be larger than 100 (e.g., “sales have increased by 150% year-over-year”).

These types of normalized comparisons can make for much more interesting messages to communicate. Luckily, they are quite easy to create in Tableau. We’ll explore the first two in this chapter, and the last two in the next.

The data source itself may include a field that is a ratio, but most often the raw data includes terms that an analyst needs to combine to create a ratio. Tableau can handle these types of computations with Calculated Fields, in which new fields can be created by combining existing fields using a variety of operations. We introduced the concept in Chapter 2, and now we’ll build on it.

Let’s return to the DSNY data set to illustrate the technique, and we’ll see if the ratio comparison is any more interesting than the absolute figures.

The data reported by DSNY includes three continuous Measures (quantitative fields) for each community district:

`RefuseTonsCollected`

Amount of nonrecyclable garbage

`PaperTonsCollected`

Amount of recyclable paper

`MGPTonsCollected`

Amount of recyclable metal, glass, and plastic

The second two data types (paper and MGP) are both recyclable, and
so can be summed to create a new variable for each community district that
we will call `RecyclableTonsCollected`

to
stay consistent with the provided nomenclature.

A comparison of the ratio of recyclable material to refuse would be interesting to consider, as it would indicate which communities in New York City are doing a better job recycling.

First, we’ll need to create a new calculated field that sums the two
recyclable fields. Right-click anywhere in the lefthand Dimensions or
Measures panels, and select *Create Calculated
Field*. Then, fill out the resulting dialog box, as shown in Figure 4-1.

Next, let’s create a second calculated field that creates the ratio of recyclable material to refuse. Right-click in the Dimensions or Measures area again and fill out the dialog box as shown in Figure 4-2.

Notice that we put the numerator and denominator in brackets,
preceded by `SUM`

. This is critical, because as we move up a level of from community
district to borough, we don’t want to *add* the
individual ratios of each community district; we want to create a new
ratio that represents the quotient of the aggregate amounts. The first
approach would create a larger ratio that is misleading.

Now that we’ve created the ratios, how can we best visualize them?
Let’s start with a simple bar chart of the aggregate ratios of each of the
boroughs. To create one, we’ll drag the `Borough`

field from the Dimensions area on the
left to the Rows shelf, and we’ll drag the `Recycle to Refuse Ratio`

calculated field to the
Columns shelf. Color wouldn’t strictly be needed, but we’ll add it anyway
for aesthetics by dragging `Borough`

to the Color shelf. Lastly, we’ll sort the bars in descending order to put the borough
with the highest recycle ratio at the top. The screen shot in Figure 4-3 shows the resulting view.

We can easily see from this simple bar chart that Manhattan has the highest ratio of recycle to refuse, but that it still only recycles about one ton of material for every four tons of trash. The ratio for Manhattan is about twice as great as the ratio for the Bronx.

Does this mean each community district in Manhattan had a higher recycle ratio than each community district in the Bronx? Not necessarily, as we’re dealing with the terms in the aggregate.

How would we communicate the results at a community district level?
If we just replace `Borough`

with
`CommunityDistrict`

in the Rows shelf,
then we’ll get a bar chart where all the community district #1s will be
aggregated, and all the community district #2s will be aggregated, and so on. Aggregating community districts
in this way isn’t very meaningful.
Instead, we’d like to see community district ratios for each
borough.

To make this comparison, let’s click and drag the `Borough`

pill from the Rows shelf up to the
Columns shelf, and add the `CommunityDistrict`

field from the Measures panel
to the Rows shelf to create a grid of bar charts, as shown in Figure 4-4.

From this view, we can now see that even within boroughs, not all community districts are created equal. Manhattan community district #1 has the highest ratio overall, but it’s clear that some community districts in the Bronx have higher ratios than some in Manhattan.

But what if we wanted to know the exact ratios? The bar charts in Figure 4-4 make it fairly easy to get a general sense of the relative ratios for the community districts, but precision isn’t easily obtained, is it? Can you tell the exact ratio for Manhattan community district #1? I can’t. If knowing the precise amounts is a task our audience will want to perform, we’ll need to think about how to make it easy for them to accomplish that task.

We could increase precision by dragging the `Recycle to Refuse Ratio`

from the Measures area
to the Label shelf (or Ctrl-selecting `AGG(Recycle to Refuse Ratio`

) and dragging it
from the Columns shelf to the Label shelf to duplicate). The
resulting bar chart grid with labels is shown in Figure 4-5.

That’s a lot of decimal places, isn’t it? Our audience probably
doesn’t need to know the ratio to the fourth decimal place
(ten-thousandths). We can change the number of digits shown by
right-clicking the ```
Recycle
to Refuse Ratio
```

in the Measures panel, selecting *Default Properties*, then *Number Format*, and choosing *Number (Custom)*. In the resulting dialog box,
we’ll change the decimal places to three, as shown in Figure 4-6.

Notice that the labels in the bar chart grid change accordingly.

What we really have here is a table that has been augmented by adding bars with lengths proportional to the values in each cell. Another way to show this precise comparison is by getting rid of the bars and instead coloring each cell by the ratio—a highlight table.

Let’s make a highlight table by starting with a new sheet.
Ctrl-click `Borough`

, `CommunityDistrict`

, and `Recycle to Refuse Ratio`

, and then open the
*Show Me* panel and select *highlight table*. Tableau creates the view shown
in Figure 4-7.

Notice that Tableau created a highlight table that’s “flipped”
compared to our bar chart grid shown in Figure 4-5 (`Borough`

is in Rows instead of Columns, and
`CommunityDistrict`

is in Columns instead
of Rows). To flip it to match, we’ll just click the *Swap* button in the toolbar, , resulting in the taller table shown in Figure 4-8.

Tableau has defaulted to a sequential green color palette, meaning
the gradient increases proportionately with increasing `Recycle to Refuse Ratio`

. Our eyes can easily
pick out the darkest cell, and we also find the lightest rather easily.
Most people can distinguish between about six different levels of
intensity. If the goal were to communicate the precise values, we could
have created a simple table without the highlights. Adding the highlights
makes it easier to quickly compare cells and find the highest and lowest
values.

It’s still not immediately obvious which community district has the second-highest ratio, though. Our eyes have to dart around to all the dark cells and read and memorize the values. Eventually we settle on Brooklyn community district #6 as having the second-highest ratio, but the task gets harder as we move along. Which has the third highest? The fourth? Is there an easier way to facilitate this assessment of rank?

What we’d like is a simple list of community districts in descending
order of the ratio. To make this list, let’s first create a new field that
combines the `Borough`

and `CommunityDistrict`

fields into a single
Dimension.

Ctrl-click `Borough`

and `CommunityDistrict`

so they are both selected,
then right-click on one of the two selected fields (either will do fine)
and select *Combine Fields*. A new
string field will appear in the Dimensions area called ```
Borough & CommunityDistrict
(Combined)
```

.

Create the list shown in Figure 4-9 by dragging `Borough & CommunityDistrict`

to the Rows
shelf, `Recycle to Refuse Ratio`

to the
Columns shelf, and `Borough`

to the Color shelf, sorting in
descending order.

Now we can quickly tell which are the top four or five districts, and which are the bottom four or five. Using the colored bars, we also can see some patterns, like that the districts with the lowest ratio are all in the Bronx (blue), and most of the top ten are in Manhattan (green).

But if I asked you to pick out the tenth or the fifteenth districts, you’d have to count down from the top, right? Not the easiest task to perform. How can we add rank to this list to easily pick out the order from 1 to 59?

Let’s create a `Rank`

field. To
do so, right-click anywhere in the Dimensions or Measures panel
on the left and select *Create a Calculated
Field* once again. This time, name the field `Rank`

and enter `INDEX()`

. As soon
as the message appears in the bottom left indicating “the calculation is
valid,” blue text will also appear in the top right that says “Default
Table Calculation.” Click on this blue text and change the *Compute using* drop-down to *Borough & CommunityDistrict (Combined)*,
as shown in Figure 4-10.

This tells Tableau that the rank should apply to the combination
of `Borough`

and `CommunityDistrict`

.

After clicking *OK* twice, we
now see a new field in the Measures panel called `Rank`

. Notice that this field is continuous
(the `#`

symbol is green), but we can consider the rankings to be discrete integers,
because there is no rank between any two successive ranks. To change the
`Rank`

data field type, right-click on
it in the Measures panel and select *Convert to
Discrete*. The `#`

symbol
next to `Rank`

will turn blue, indicating that it’s discrete. We’re now ready
to add `Rank`

to the chart.

To add the `Rank`

field to the
bar chart, simply drag it to the left of `Borough & CommunityDistrict (Combined)`

on
the Rows shelf. After removing the row dividers, we have the colored, sorted, and now ranked bar chart
shown in Figure 4-11.

Alternatively, since the launch of Version 8.1, users can create a
similar `Rank`

field by making use of
the new Rank Table Calculation. To do so, drag another instance of
`Recycle to Refuse Ratio`

to the
columns shelf, click in the down arrow of the new pill, and change it to
*Discrete*, and then click the down
arrow again and select *Quick Table
Calculation* and then *Rank*. Finally, click the down arrow once more
and select *Compute using*, then
choose *Borough CommunityDistrict
(combined)*. (This calcuated field also lets you handle ties
in different ways.)

Now the only task this chart doesn’t facilitate is gleaning
precise values. We could add the labels to the right of the bars as we
did in Figure 4-5 by simply
dragging the `Ratio`

field to the Label
shelf, but perhaps a better approach would be to add the labels to the
left of the bars so as not to interfere with the perception of the bars’
lengths.

Because `Recycle to Refuse Ratio`

is a continuous data field (the `#`

symbol is green), if we just drag it out onto the Rows shelf to the
right of ```
Borough &
CommunityDistrict
```

, Tableau will create 59 y-axes for each of
the bars. Try it and see what I mean.

We just want the ratio value itself to appear, so we’ll first
convert `Recycle to Refuse Ratio`

to a
discrete field the same way we converted `Rank`

, and then we’ll drag the ratio onto the
Rows shelf to the far right, leaving a blue pill and creating the
updated view shown in Figure 4-12.

We’ve now visualized the ratio in a way that makes it easy for our audience to perform a number of different tasks very quickly:

Get a general sense of how the 59 community districts compare relative to one another

Determine which districts have the highest ratios, and which have the lowest

Get a general sense of which boroughs have districts at the top of the list, and which have districts at the bottom

Know the precise ratio for any district

Pick out any district in rank order (e.g., the 12

^{th}or the 28^{th})

This view is a very versatile and useful way to communicate the
ratio of recycle to refuse. But it doesn’t tell us which community
districts produced more trash overall *per person*.
For all we know, the districts at the top may be recycling a lot, but
they may also be producing a much higher amount per person. That’s where
rates come into play.

Recall that a rate is just a special type of ratio in which the numerator and the denominator have different units. In the previous example, the ratio of recycle to refuse involved terms with the same units—namely, tons. When we consider the trash production rate per person, our numerator is still tons, but now our denominator is the number of people.

But the data set we have been working with so far doesn’t include community district population. In order to determine trash production per person, we’ll need to find a way to include population data in our analysis.

Going back to the data discovery process in Figure 1-1, we’ve gone all the way around the horse track, and now we have a brand-new question that involves gathering new data.

Ideally, we would find population by community district for September 2011, which is when the DSNY collection data was taken. We can’t find that, but we came come close: 2010 census data puts us within a year’s time of the trash collection data, and is probably suitable to get a reasonable approximation of the trash production rate.

Now that we found population data, how can we create rates with it? Luckily, Tableau allows us to connect our current workbook to this new data set and “blend” it with the existing data in the workbook.

From the *Data* menu of our current workbook, select *Connect to Data* and then *Microsoft Excel*, and browse to the population
data file, connecting to the correct sheet in the file and importing the
data as an extract. We now see the new data source appear in the Data area along
with the DSNY data, as shown in Figure 4-13.

The newly imported data set is highlighted in the Data area, and the fields for that data set are shown in the Dimensions and Measures panels below. If we click on the DSNY data set, the data fields for that data set will show below.

Notice that `CommunityDistrict`

is in the Dimensions panel of the DSNY set, but it’s in the Measures
area of the new population data source. Let’s match them up by dragging
`CommunityDistrict`

from the Measures
panel of the population table into the Dimensions area.

Because both data sources now have a Dimension called `Borough`

and a Dimension called `CommunityDistrict`

, Tableau will link them
together for us. If the fields to link had different names, we would
have to manually link them using *Data* → *Edit
Relationships*.

Now that we have linked these two data sources, our updated Tableau workbook contains both terms of the refuse per person ratio: the measurement of the amount of refuse in tons (the numerator) and the count of the number of people in each community district (the denominator). We’ll create the rate the same way we created the ratio in the previous section, but now we need to create a calculated field that includes fields from two different sources, as shown in Figure 4-14.

Using the `SUM`

aggregation
type for both terms in the rate means that we can get
meaningful rates for boroughs as well as community districts. Also,
notice we are converting from short tons (US) to pounds (lbs.) by
multiplying the numerator by 2,000—which is how many pounds are in each
ton. On a per person basis, it’s much easier for us to think in terms of
pounds, so this is a better unit to use for this comparison.

Now that we have created the rate, we can visualize it. We’ll follow the same steps that we used to create the ratio bar chart in Figure 4-12 to create the rate comparison bar chart shown in Figure 4-15.

It’s clear from this view that the three community districts in Staten Island produced the most refuse per person in September 2011, at a rate of more than 70 pounds per person.

A word of caution about these results: often, data visualization and data analysis are best used to propose new questions to ask. Data is great at helping us make comparisons, but it doesn’t always answer “why?” and “how?” It can be easy for us to jump to conclusions based on what the data shows, but some further qualitative investigation is frequently required.

Why are the Staten Island community districts at the top of this
chart? The chart itself doesn’t tell us. It helps us formulate the next
question. Does this mean every person who lives on Staten Island
produced 70 to 80 pounds of trash that month? No, that’s not what the
data shows. The data just shows how much trash was collected from their
communities. DSNY provided *overall* refuse
collected, not just for residences, so perhaps there are industries or
businesses on Staten Island that produced a lot of trash.

The point is to be careful with what we communicate, and not to communicate more than the data gives us license to say.

In this chapter, we learned how to use Calculated Fields and blended data sources to create ratios (quotients) and rates (quotients with mixed units). We also learned the Index function as well as the Rank table calculation, and created highlight tables and bar charts to compare different dimensions—at both the borough and community district levels. This enabled us to go beyond overall amounts (“how much”) and to consider relative amounts (“how much per”).

In the next chapter, we’ll consider another type of normalized comparison: proportions and percentages.

Start Free Trial

No credit card required