Chapter 4. Working with Sets
Although you can interact with your data one row at a time, relational databases are really all about sets. This chapter explores the use of set operators, which allow you to combine data from multiple result sets. I’ll start with a quick overview of set theory and then move on to show how the set operators union
, intersect
, and except
can be used to blend multiple data sets together.
Set Theory Primer
You may recall from an early math class seeing diagrams such as the one shown in Figure 4-1.
The shaded area in Figure 4-1 represents the union of sets A and B, with the overlapping area included just once. Here’s another way of illustrating the union
operation, using two sets of integers:
A = {1, 2, 4, 7, 9} B = {3, 5, 7, 9} A union B = {1, 2, 3, 4, 5, 7, 9}
In this example, set A contains 5 integers, and set B contains 4 integers. Each set has a couple of unique values, but both sets share the values 7 and 9. The union
of A and B yields a total of 7 values, with the integers 7 and 9 included just once.
Next, let’s look at the area shared by two sets, known as the intersection. Figure 4-2 shows the graphical depiction of an intersection.
Using the same sets of integers as the last example, you can see that the intersection of A and B contains only the numbers 7 and 9:
A = {1, 2, 4, 7, 9} B = {3, 5, 7, 9} A intersect B = {7, 9}
If there is no overlap between the two sets, the intersection would be empty, and intersect
is the only set operator that can yield an empty set.
The third and last diagram shown in Figure 4-3 illustrates an operation where the contents of one set is returned minus any overlap with another set.
Known as the except
operation (but you may also use minus
interchangeably), this operation yields all of set A without any overlap with set B. Here’s the same operation using the numeric sets:
A = {1, 2, 4, 7, 9} B = {3, 5, 7, 9} A except B = {1, 2, 4}
Of the three set operations, except
is the only one that yields a different result if you switch the order of the sets. Here’s the same example, but showing the result of B except
A:
A = {1, 2, 4, 7, 9} B = {3, 5, 7, 9} B except A = {3, 5}
These three set operators cover many situations, but how might you describe the case shown in Figure 4-4?
The operation illustrated in Figure 4-4 cannot be achieved using one of the three set operators described previously. It can be achieved, however, by combining multiple set operators. Here are two different ways to generate the shaded area shown in Figure 4-4:
(A union B) except (A intersect B) -- or -- (A except B) union (B except A)
The next sections in this chapter will show you how to put these set operators to work.
The union Operator
The union
operator allows you to combine two data sets. Here’s a simple example:
PUBLIC>select 1 as numeric_col, 'ABC' as string_col union select 2 as numeric_col, 'XYZ' as string_col; +-------------+------------+ | NUMERIC_COL | STRING_COL | |-------------+------------| | 1 | ABC | | 2 | XYZ | +-------------+------------+
This is about as simple as it gets: two single-row result sets, without any overlapping values, combining to make a two-row result set. This type of statement is known as a compound query because it contains two independent queries separated by a set operator. Before moving to more advanced examples, here are some restrictions related to the two sets being combined:
- Both sets must contain the same number of columns.
- The data types of the columns must match.
Here’s how Snowflake responds if the two sets have a different column count:
PUBLIC>select 1 as numeric_col, 'ABC' as string_col union select 2 as numeric_col, 'XYZ' as string_col, 99 as extra_col; 001789 (42601): SQL compilation error: invalid number of result columns for set operator ...
In this case, you will see a very specific error message from Snowflake, but here’s the error thrown if the column types don’t match:
PUBLIC>select 1 as numeric_col, 'ABC' as string_col union select 'XYZ' as numeric_col, 2 as string_col; 100038 (22018): Numeric value 'ABC' is not recognized
This error message is more general in nature, but it tells you that given the data type used in the first column of the first query (the number 1), the value 'ABC'
does not match.
As mentioned earlier, the union
operator removes duplicates when constructing the result set, meaning that any overlaps are represented only once. Here are the same two sets used earlier ({1, 2, 4, 7, 9} and {3, 5, 7, 9}) built using the values
subclause and then combined using union
:
PUBLIC>select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val) union select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val); +-------------+ | INTEGER_VAL | |-------------| | 1 | | 2 | | 4 | | 7 | | 9 | | 3 | | 5 | +-------------+
As you can see, both sets include 7 and 9, but these values only appear once in the result set because the union
operator sorts the values and removes duplicates.
While this is the default behavior, there are cases when you don’t want the duplicates to be removed, in which case you can use union all
:
PUBLIC>select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val) union all select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val); +-------------+ | INTEGER_VAL | |-------------| | 1 | | 2 | | 4 | | 7 | | 9 | | 3 | | 5 | | 7 | | 9 | +-------------+
The result set now includes nine rows instead of seven, with the duplicate 7s and 9s making up the extra two rows.
The sample database includes data from 1992 through 1998; here’s a query that uses union
to find all customers who placed orders greater than $350,000 in 1992 or 1993:
PUBLIC>select distinct o_custkey from orders where o_totalprice > 350000 and date_part(year, o_orderdate) = 1992 union select distinct o_custkey from orders where o_totalprice > 350000 and date_part(year, o_orderdate) = 1993; +-----------+ | O_CUSTKEY | |-----------| | 114856 | | 116683 | | 4334 | | 61790 | | 24275 | | 20665 | | 130745 | | 70054 | | 122047 | | 123934 | ... <847 rows omitted> | 102304 | | 106813 | | 109519 | | 120437 | | 131131 | | 130987 | | 108745 | | 5330 | | 95137 | | 59665 | +-----------+
This query could have been written with a single filter condition and without the union
operator, but this form of the query will be useful to compare the results of intersect
and except
in the next two sections.
The intersect Operator
As described earlier in the chapter (see “Set Theory Primer”), the intersect
operator is used to find the intersection between two data sets. Here’s an example using the two numeric data sets from earlier:
PUBLIC>select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val) intersect select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val); +-------------+ | INTEGER_VAL | |-------------| | 7 | | 9 | +-------------+
The overlap of sets A and B are the values 7 and 9; all other values are unique to each set.
In the previous section, the query against the Orders
table using union
returned 867 rows. Let’s see what happens when union
is changed to intersect
for the same query:
PUBLIC>select distinct o_custkey from orders where o_totalprice > 350000 and date_part(year, o_orderdate) = 1992 intersect select distinct o_custkey from orders where o_totalprice > 350000 and date_part(year, o_orderdate) = 1993; +-----------+ | O_CUSTKEY | |-----------| | 100510 | +-----------+
While there are 867 different customers who placed orders over $350,000 in 1992 or 1993, only one of them placed such orders in both years.
The except Operator
The third and last of the set operators is except
, which is used to return rows from set A without any overlap with set B. Here’s an example using the two numeric data sets from earlier:
PUBLIC>select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val) except select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val);
+-------------+ | INTEGER_VAL | |-------------| | 1 | | 2 | | 4 | +-------------+
This operation yields the set {1, 2, 4}, which are the three values found in set A but not in set B. Switching the order of the two sets yields a different result:
PUBLIC>select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val) except select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val); +-------------+ | INTEGER_VAL | |-------------| | 3 | | 5 | +-------------+
Unlike union
and intersect
, which will yield the same results regardless of which set is above or below the set operator, keep in mind that A except
B will usually yield different results than B except
A.
Returning to the Orders
query, here are the results when using except
:
PUBLIC>select distinct o_custkey from orders where o_totalprice > 350000 and date_part(year, o_orderdate) = 1992 except select distinct o_custkey from orders where o_totalprice > 350000 and date_part(year, o_orderdate) = 1993; +-----------+ | O_CUSTKEY | |-----------| | 134878 | | 125183 | | 44240 | | 79138 | | 71119 | ... <444 rows omitted> | 74929 | | 149240 | | 23587 | | 79999 | | 46490 | +-----------+
This query returns 454 customers and when the order of the sets is reversed (orders from 1993 first), the query returns 412 rows.
Set Operation Rules
The following sections outline some rules to keep in mind when working with set operators.
Sorting Compound Query Results
If you want your results to be sorted when using set operators, you need to abide by the following rules:
- There may only be one
order by
clause, and it must be at the end of the statement. - When specifying column names in the
order by
clause, you must use the column names/aliases from the first query.
Regarding the second point, the individual queries used as part of a compound statement may retrieve data from different tables having different column names, so the column names used in the first query determine what can be specified in the order by
clause. Here’s an example:
PUBLIC>select distinct o_orderdate from orders intersect select distinct l_shipdate from lineitem order by o_orderdate; +-------------+ | O_ORDERDATE | |-------------| | 1992-01-03 | | 1992-01-04 | | 1992-01-05 | | 1992-01-06 | | 1992-01-07 | ... <2,394 rows omitted> | 1998-07-29 | | 1998-07-30 | | 1998-07-31 | | 1998-08-01 | | 1998-08-02 | +-------------+
In this example, the column used for sorting must be by o_orderdate
rather than l_shipdate
; here’s what happens if you specify l_shipdate
as the sort column:
PUBLIC>select distinct o_orderdate from orders intersect select distinct l_shipdate from lineitem order by l_shipdate; 000904 (42000): SQL compilation error: error line 4 at position 9 invalid identifier 'L_SHIPDATE'
To avoid confusion in these cases, you can specify matching column aliases for both queries and use the alias in your order by
clause.
Set Operation Precedence
If your compound query contains more than two queries using different set operators, you need to consider the order in which to place the queries in order to achieve the desired results. Figure 4-5 shows the “mystery operation” introduced earlier as Figure 4-4.
Let’s see how this could be accomplished using the two simple numeric sets used throughout this chapter:
PUBLIC>select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val); +-------------+ | INTEGER_VAL | |-------------| | 1 | | 2 | | 4 | | 7 | | 9 | +-------------+ PUBLIC>select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val); +-------------+ | INTEGER_VAL | |-------------| | 3 | | 5 | | 7 | | 9 | +-------------+
As mentioned earlier in the chapter, there are multiple ways to generate the shaded area in Figure 4-5, which in this case is the set {1, 2, 3, 4, 5}, but let’s choose the second one:
(A except B) union (B except A)
Here’s how that might be constructed for the two numeric sets:
PUBLIC>select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val) except select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val) union select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val) except select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val); +-------------+ | INTEGER_VAL | |-------------| | 3 | | 5 | +-------------+
All the pieces seem to be in place and in the right order, but the results don’t seem to be correct. The issue here is that the server is applying these operations from the top down, whereas these operations need to be grouped together. Specifically, the two except
operations need to be done independently, and the two resulting sets can then be joined together using union
. To do so, you must use parentheses:
PUBLIC>(select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val) except select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val) ) union (select integer_val from (values (3), (5), (7), (9)) as set_b (integer_val) except select integer_val from (values (1), (2), (4), (7), (9)) as set_a (integer_val) ); +-------------+ | INTEGER_VAL | |-------------| | 1 | | 2 | | 4 | | 5 | | 3 | +-------------+
When building compound queries with three or more set operators, make sure you consider how the different operations may need to be grouped in order to achieve the desired results.
Wrap-Up
In this chapter, you learned how the set operators union
, intersect
, and except
can be used to identify overlaps between multiple data sets. You were also introduced to strategies for combining these operations in order to identify complex relationships between data sets. While you may not see these set operators used in many examples in later chapters, keep these powerful tools in mind when performing data analysis or crafting reports.
Test Your Knowledge
The following exercises are designed to test your understanding of set operators. Please see “Chapter 4” in Appendix B for solutions.
Exercise 4-1
Given the following two sets:
A = {3, 5, 7, 9} B = {4, 5, 6, 7, 8}
Which sets are generated by each of the following operations?
- A union B
- A union all B
- A intersect B
- A except B
- B except A
Exercise 4-2
Write a compound query that returns the names of all regions (Region.r_name
) and nations (Nation.n_name
) that start with the letter A
.
Exercise 4-3
Modify the query from Exercise 4-2 to sort by name (the default sort is fine).
Get Learning Snowflake SQL and Scripting now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.