Chapter 1. Introduction
It is estimated that 70%–80% of job postings for a data analyst mention statistics as a desired skill or requirement. I haven’t found a way to prove those numbers myself, but looking at job postings, I would argue in favor of that estimate. With ever-increasing amounts of data, businesses are looking for ways to interpret and understand that data. Statistics is often the most scientific way to do that. However, I think many analysts and Tableau developers struggle to implement statistics into their analysis or data visualizations. There are many reasons for this, and I will be the first one to tell you that it is not for lack of trying. Statistics can be intimidating for both developers and the stakeholders who rely on their reports. Trying to explain and interpret complex statistical equations is tough without a firm understanding of the discipline.
That is the exact purpose of this book. I want to equip you with that firm understanding of statistics and give you the confidence to speak to the equations and implement them in your work. In this book, I will be focusing on bringing data visualization in Tableau together with statistical analysis so that you can support your insights with scientific evidence.
In this chapter, I will introduce you to some common Tableau terminology I will be using throughout the book. I am also going to introduce you to some basic statistical terms and ideas. Toward the end of the chapter, I will present you with a case study that ties both disciplines together, and I will discuss the importance of visualizing statistical results.
Introduction to Tableau
It is important to understand that Tableau is not simply a data visualization tool, but a company with a suite of tools to support data visualization and analytics at an enterprise level. There are many products within Tableau’s ecosystem, including Tableau Desktop, Tableau Cloud, Tableau Server, Tableau Prep Builder, Tableau Public, and more.
Some of these products require a license to use, while others, such as Tableau Public, do not require you to purchase a license, but there are certain limitations. With a license, you can publish your workbooks to Tableau Server or Tableau Cloud from Tableau Desktop. This allows your users to view and interact with your data visualizations from a browser. Checkout the Tableau website for a full list of all Tableau’s products.
Common Terms of the Authoring Interface of Tableau Desktop
There are several common terms within Tableau Desktop that I want you to know and be familiar with. To begin with, when you open Tableau Desktop, you will land on the Start Page, as shown in Figure 1-1.
From the Start Page, you can connect to the data you want to visualize. Tableau has hundreds of connectors that you can use to access your data. A connector is basically like a built-in API that allows you to establish a connection to a database or file type to read that data into Tableau Desktop. On the lefthand side of the Start Page you can explore all the connectors that are available.
For all the demonstrations in this book, I will be using the Sample - Superstore dataset. To connect to this dataset, simply click on Sample - Superstore, as shown in Figure 1-2.
It is important to note that if you are using a different version of Tableau Desktop than I am, you may get different results. Tableau will occasionally update the Sample - Superstore dataset. I will be using version 2023.2 throughout this book. If you want to follow along exactly, you can download this version from Tableau’s product support page.
After clicking on the sample dataset, you will be navigated from the Start Page to Tableau Desktop’s authoring interface, as shown in Figure 1-3.
To introduce you to the terms I will use throughout the book, on the lefthand side, you will find the Data pane, as shown in Figure 1-4.
At the top of the Data pane, you will see a list of the data sources you are connected to. Moving down, you will find a list of fields, including those that are calculated, separated by data source and whether Tableau believes that field is a measure or dimension.
To the right of the Data pane, you will find the different components used to create visualizations called shelves. There is the Marks shelf, Filters shelf, Pages shelf, Columns shelf, Rows shelf, and canvas, as shown in Figure 1-5.
To define each a little further, here is a brief explanation of each:
- Marks shelf
-
The Marks shelf is a key element on the authoring interface and allows you to drag fields into different properties that affect the view. The properties are Color, Size, Text, Detail, and Tooltip. There are different property options that will appear when certain conditions are met. For instance, in changing the Mark type to polygon, you will see a new property of angle in the Marks shelf.
- Filters shelf
-
The Filters shelf allows you to add different fields to filter the view on. There are eight different types of filters in Tableau that are processed at different times in Tableau’s order of operations.
- Pages shelf
-
The Pages shelf lets you break the view up into pages so that you can analyze how a specific field affects the rest of the fields in the view. The most common use of this is to add a Date dimension and animate how things change over time.
- Columns shelf
-
The Columns shelf is where you can drag fields to create the columns of the visualization you are making. The Columns shelf will coordinate with the x-axis in the view.
- Rows shelf
-
The Rows shelf is where you can drag fields to create the rows of the visualization you are making. The Rows shelf coordinates with the y-axis in the view.
- Canvas
-
The canvas is where the data visualization will appear as you begin dragging fields to the various other shelfs. You can also drag different fields directly to the canvas when you are authoring a data visualization. Doing so will add the field to the appropriate shelf for you.
The last major feature I want to call out in this chapter is in the bottom-left corner of the authoring interface. There you will find a button to navigate to the data source page and three additional buttons. These buttons are used to create new worksheets, new dashboards, or new stories, as shown in Figure 1-6.
To give you a little more context here is a brief description of each:
- Data Source button
-
This will navigate you to the Data Source page. From there, you can add new connections, create new data sources, and view the physical and logic layer for joins and blending.
- New worksheet button
-
Clicking this button will create a new worksheet and navigate you to that sheet’s tab. From here you can author a new data visualization.
- New dashboard button
-
Selecting this button will create a new dashboard and navigate you to that dashboard’s tab. From here you can drag sheets onto the canvas instead of fields to compile a new dashboard.
- New story button
-
Clicking the new story button will create a new story and navigate you to that story’s tab. From here, you can compile a story using sheets or dashboards to create different pages within your story.
Example of the Step-by-Step Instructions Throughout This Book
To get you familiar with the instructions and writing style used in this book, this section gives a simple example that puts the common terms together. Using Tableau Desktop is very intuitive, and there are many different ways to do things. I am going to show you how to create two simple charts and add them to a dashboard using the Sample - Superstore dataset. Let’s say you want to view the sales by order date. First, double-click on Sales in the Data pane, then double-click Order Date, as shown in Figure 1-7.
Tableau is intuitive enough to recognize that you likely want this data to trend over time, and it will automatically create a line chart, as shown in Figure 1-8.
Now let’s say you also want to view your sales data by segment. Click on the “New worksheet” button at the bottom left of the authoring interface, as shown in Figure 1-9.
This will open Sheet 2; your first chart is still viewable by navigating back to Sheet 1. Double-click on Sales, then Segment in the Data pane, as shown in Figure 1-10.
This will create a simple bar chart showing the SUM(Sales) (sum of sales) by Segment on the canvas, similar to Figure 1-11.
So far, you’ve been able to view these two charts in a working environment in Tableau. Let’s say you want to share these charts with others in your organization. To begin that process, click on the “New dashboard” button in the bottom left of the authoring interface, as shown in Figure 1-12.
This will open a new canvas where you can create dashboards, as shown in Figure 1-13. Dashboards are the bread and butter of Tableau and are ultimately what you will share for users to interact with.
Now add your two sheets on the dashboard canvas. On the left, click and drag Sheet 1 onto the canvas. Then click and drag Sheet 2 onto the canvas. Your dashboard should now look similar to Figure 1-14.
This example should help you see how Tableau’s common terms will be used in the tutorials throughout this book. Knowing the layout of the tool and terms is the foundation for understanding Tableau Desktop as a whole. The content thus far has most likely been a review for you. From here on, I will be showing you how to tie statistics into your dashboards and giving you tangible examples of how to implement them into your work! In the next section, I will be introducing you to common statistical terms and showing you an example that ties everything together.
Introduction to Statistics
According to Merriam-Webster’s online dictionary, statistics is defined as a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data. I personally think this definition nails it on the head, especially in today’s business environment. To unlock deep insights in your data, you need to incorporate statistics into almost every aspect of the analytics process. This includes collecting data in an efficient and ethical way, understanding the data, finding deeper insights in the analysis, and presenting your findings so your stakeholders can make informed decisions.
In the next section, I will introduce you to some common statistical terms and ideas. I will also show you how powerful adding statistics to your analysis can be through a tangible case study example.
Common Statistical Terms
To level-set, I will briefly explain some of these terms and ideas. However, this is not a comprehensive list of everything there is to know about statistics. The purpose of this book is to get you more comfortable and familiar with foundational statistics so you can apply them to your own work. I will also go into more detail about some of these terms as you progress through each chapter, where applicable:
- A statistic
Throughout this book, you will see me refer to different things as a statistic. The definition of a statistic is a fact or piece of data from a study of a large quantity of numerical data. This means anything you can calculate from a large set of data could be referred to as a statistic. For instance, if we calculate the mean, median, or mode of a dataset, I would refer to each of those values as a statistic.
- Hypothesis testing
Setting up a hypothesis test is one of the most foundational steps in most statistical analysis. Without doing so, you will find yourself chasing some sort of statistical significance when simply proving there isn’t a significant difference is just as powerful. Basically, a hypothesis test is when you create a null hypothesis and an alternative hypothesis. Then you lay out the conditions of what you consider a significant difference in favor of one hypothesis or the other by setting a significance level.
- Significance level
A significance level is the predetermined threshold used to determine statistical significance. The most common significance level is 0.05 (5%), but this is just an arbitrary number. There are times when you can be more or less significant. For instance, if you are in healthcare, you may want to show results that have a higher level of significance to ensure the most accurate interpretation of the results.
- Statistical significance
Statistical significance is a term used in statistics to determine whether an observed effect or relationship in data is likely to be genuine or if it could have occurred by chance. In other words, it helps analysts assess whether the results of an analysis are meaningful or if they could be attributed to random variation.
To tie that back to the hypothesis tests:
- P-value
-
The p-value (probability value) is a measure of the evidence against a null hypothesis. It represents the probability of obtaining the observed results (or more extreme results) if the null hypothesis is true. A low p-value (typically less than 0.05) is considered indicative of statistical significance.
If the p-value is less than the chosen significance level (commonly 0.05), the null hypothesis is rejected in favor of the alternative hypothesis. If the p-value is greater than the significance level, there is not enough evidence to reject the null hypothesis.
In summary, you define what you are going to test by setting up a null hypothesis and alternative hypothesis. Then you decide what the significance level should be for your experiment. Then you test for a statistically significant difference and use the p-value as a unit of measure compared to your predetermined significance level.
To really drive these ideas home, I want to show you a practical example we can calculate by hand. This way, you can see how these terms all come together.
Practical Application Through a Case Study
Let’s say that your company wants to test some new marketing in an email. However, they are worried that if the new marketing fails, it could significantly impact sales for this quarter. Therefore, they want to test the new marketing email by sending it to a subset of the total email list, then analyze the performance before deciding whether to move forward with the new marketing. Table 1-1 shows the results of the test displayed in a contingency table.
Original email | New marketing email | |
---|---|---|
Nonconversions | 727 | 117 |
Conversions | 23 | 8 |
A contingency table is a way to organize and display data in a table format, especially when studying the relationship between two categorical variables. Categorical variables are variables that represent categories or groups, such as colors, types of fruits, or responses to a yes-no question. In this example, we are displaying how many conversions the original email had compared to the new marketing email. Conversion has been defined as “the point at which a recipient of a marketing message performs a desired action.”1
The marketing team has done a simple analysis looking at the conversion rates of the emails by taking the total sent/conversions of each campaign. Using this calculation, they found that the original email had a conversion rate of about 3% (23 ÷ 750 = 0.030) and the new marketing email had a conversion rate of about 6% (8 ÷ 125 = 0.064). They claim that the new email is an absolute success and that it will lead to double the amount of conversions when they send it out to their entire list next time.
Senior leaders at the business are thrilled with the idea of doubling the amount of sales and want to invest in several new salespeople to help with the increase. However, they come to you for a second opinion and ask if the analytics team could review the data and confirm the marketing team’s assumptions.
Where do you begin? This is where statistical analysis will become your best friend. Armed with some basic statistics, you know that you can run a few simple tests to tell you if the new marketing email was statistically significant or not. Before I get too far in the weeds, let’s set up the hypothesis and determine the significance level to test for.
Setting up the hypothesis test
The first thing you need to do in this situation is to set up a hypothesis test. In a standard hypothesis test, you set the two hypotheses: null and alternative. For this example the hypothesis will be as follows:
- Null hypothesis
-
The new marketing email is not statistically significant; therefore, email conversions will remain the same on average as the original.
- Alternative hypothesis
-
The new marketing email is statistically significant; therefore, email conversions will be higher on average than the original.
To prove the statistical significance, I will be looking for a p-value less than 0.05, which is my significance level.
In statistics, it’s important to understand that you are always trying to validate your assumptions using mathematics. What do I mean by that? You always want to assume that the results aren’t going to change when new things are introduced. Therefore, you want to assume that the null hypothesis is correct, and your test will determine if that is wrong. In statistics, you would say you have failed to reject the null hypothesis if the p-value is greater than your predetermined significance level. If the p-value is less than the significance level, then the test is statistically significant, and you would reject the null hypothesis in favor of the alternative.
Chi-square test
Now that you have your hypothesis set up, it’s time to run a statistical analysis. In the spirit of providing you with a foundational understanding, I have decided to run a simple statistical test called a chi-square test. A chi-square test is a statistical test used to determine whether there is a significant association (or independence) between two categorical variables. It is particularly useful when working with data that can be organized into a contingency table.
This is a great option to run in this situation and very accessible, even if you’re new to statistics. You don’t have to have any special software or know any coding to calculate this test. You can do it by hand, run it in Excel, or look for a calculator online.
To begin, let’s revisit the contingency table and add to it. As you can see in Table 1-2, I added totals for each column, row, and a grand total column.
Original email | New marketing email | Totals | |
---|---|---|---|
Nonconversions | 727 | 117 | 844 |
Conversions | 23 | 8 | 31 |
Totals | 750 | 125 | 875 |
Now you need to calculate expected values (E) for each of the cells in the table. The formula is very easy. Take the row total, multiply it by the column total for each cell, and then divide by the grand total. So for the top-left cell (original email by nonconversions) you would take 750 × 844 ÷ 875 = 723.43. I will calculate each of the expected values in the corresponding cells in Table 1-3.
Original email | New marketing email | Totals | |
---|---|---|---|
Nonconversions | E11 (750 × 844) ÷ 875 = 723.43 | E12 (125 × 844) ÷ 875 = 120.57 | 844 |
Conversions | E21 (750 × 31) ÷ 875 = 26.57 | E22 (125 × 31) ÷ 875 = 4.43 | 31 |
Totals | 750 | 125 | 875 |
You can see that I added some mathematical syntax for each cell (E11, E12, E21, and E22). This is referring to the expected value for the cell in row x and column y. So E11 is the expected value in row 1/column 1. E12 is the expected value for row 1/column 2, and so on. I will continue to use mathematical expressions and syntax similar to this throughout the book and introduce you to mathematical syntax along the way.
With your expected values calculated, you need to finish by comparing those values to the values you observed. This step is expressed mathematically by the following formula:
Simply put, you need to take the original value minus the expected value you just calculated, square that, and then divide by the expected value. You will do this for each cell and then add up each of the values we get. Looking at E11, we have the original value of 727 minus the expected value of 723.43, which equals 3.57. Take 3.57 and square it, which equals 12.7449. Then divide that by the expected value. So 12.7449 ÷ 723.43 = 0.017617. I’ll round that number up to 0.018. You can follow along in Table 1-4 for each cell.
Original email | New marketing email | Totals | |
---|---|---|---|
Nonconversions | (727 – 723.43)2 ÷ 723.43 = 0.018 | (117 – 120.57)2 ÷ 120.57 = 0.106 | 844 |
Conversions | (23 – 26.57)2 ÷ 26.57 = 0.48 | (8 – 4.43)2 ÷ 4.43 = 2.877 | 31 |
Totals | 750 | 125 | 875 |
Now you take the values you got in each cell in Table 1-4 and add them up. Here are the values we got for each cell:
E11 = 0.018
E12 = 0.106
E21 = 0.48
E22 = 2.877
X2 = (0.018 + 0.106 + 0.48 + 2.877) = 3.481
That gives you an X2 observed value of 3.481. The decision rule for a chi-square test is as follows: if the X2 observed value is greater than the X2 critical value, you reject the null hypothesis. So far, I have calculated the X2 observed value, but I need to get the X2 critical value. Remember, for our hypothesis test, we set a significance level of 0.05. Using that significance level, you can determine the X2 critical value.
The best way to find the critical value is to look it up in a distribution table. A distribution table is a resource you can find online that is a large table of critical values that are precalculated for you. Using the significance level of 0.05, I found the X2 critical value to be 3.84.
Considering that the observed value versus the critical value 3.481 is not greater than 3.84, you would therefore fail to reject the null hypothesis. In simple terms, this means that the test proved that the new marketing email did not have a statistically significant increase in conversions. You can conclude that moving forward with this new email marketing campaign will yield similar results to the original, on average.
Conclusions drawn from statistical analysis
I chose this example for two reasons: (1) this is a common, real-world example that gives you a foundational understanding of statistics and how it’s used, and (2) this example comes really close to being statistically significant. In statistics, one of the most important lessons is to understand the data and make some assumptions.
In this situation, I may go back and say that the results did not yield a significant increase in conversions. However, the data suggests that there is a slight improvement. My recommendation would be to hold off on hiring, run the test again next quarter, and split the total emails sent 50/50 versus 75/25. This would give the team a larger sample size to rerun the analysis. After all, you can make the assumption that while the new campaign did not yield statistically significant results to prove it increased conversions, the results did suggest that the new marketing email did not hurt conversions in any way.
Therefore, it’s not always as black and white as it appears. Unlike traditional mathematics, when using statistics, you have to be able to think outside the box and make further recommendations after an analysis.
Data Visualization and Statistics
In closing, there is an obvious advantage of data visualization when trying to find quick insight in your data; from the previous example, you can see the power statistical analysis can have when making decisions. However, bringing them together is where you will truly unlock the most of any analytics tool or analysis.
I want to share a great example to drive home the importance of bringing data visualization together with statistical analysis. In Table 1-5, I have four statistical summaries from four different datasets.
Dataset 1 | Dataset 2 | Dataset 3 | Dataset 4 | |||||
---|---|---|---|---|---|---|---|---|
X | Y | X | Y | X | Y | X | Y | |
Obs | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
Mean | 9.00 | 7.50 | 9.00 | 7.50090 | 9.00 | 7.50 | 9.00 | 7.50 |
SD | 3.16 | 1.94 | 3.16 | 1.94 | 3.16 | 1.94 | 3.16 | 1.94 |
r | 0.82 | 0.82 | 0.82 | 0.82 |
Here you can see some statistics, such as the standard deviation, r, mean, and the number of observations in each dataset. I will explain each of these statistics in detail in upcoming chapters; however, notice they are the same across all four datasets. If you were to plot the datasets and visualize them, as shown in Figure 1-15, you could clearly see that each dataset is very different.
Figure 1-15 is an example of Anscombe’s quartet, and it was constructed by the statistician Francis Anscombe in 1973 to demonstrate the importance of visualizing your data before and after modeling it. When building statistical models, you need to visualize the data to truly understand what the story is—if there are outliers, correlation, normalization; the list goes on. On the other hand, data visualization alone leaves a lot of assumptions and room for misinterpretation, so you need to back it up with statistics. The rest of this book will be about just that.
Summary
In this chapter, I discussed what Tableau is and listed several of its key products. Then I went over some key terms that I will use throughout the book when walking you through each tutorial. This foundational knowledge will be key in later chapters, especially if you are newer to Tableau.
Then I touched on some foundational statistical terms and ideas. After that, I tied those terms together with a practical case study. To introduce you to the idea of how statistics and data visualization come together, I showed you the Anscombe’s quartet example.
In the following chapters, I will show you how to start incorporating statistical analysis into your data visualizations in Tableau. You will learn to visualize distribution of your data, detect outliers, forecast future values, create a cluster analysis, use regression to make predictions, and connect to external resources for more advanced statistical models.
If you have made it this far and still need some additional foundational practice, I would recommend the following books to become more familiar with Tableau Desktop and its capabilities:
-
Practical Tableau by Ryan Sleeper (O’Reilly, 2018)
-
Tableau Desktop Cookbook by Lorna Brown (O’Reilly, 2021)
-
Tableau Strategies by Ann Jackson and Luke Stanke (O’Reilly, 2021)
1 See David Kirkpatrick’s blog article on conversion, “Marketing 101: What Is Conversion?,” MarketingSherpa, March 15, 2021.
Get Statistical Tableau now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.