Chapter 4. Working with Sets

Although you can interact with your data one row at a time, relational databases are really all about sets. This chapter explores the use of set operators, which allow you to combine data from multiple result sets. I’ll start with a quick overview of set theory and then move on to show how the set operators union, intersect, and except can be used to blend multiple data sets together.

Set Theory Primer

You may recall from an early math class seeing diagrams such as the one shown in Figure 4-1.

Figure 4-1. Union of two sets

The shaded area in Figure 4-1 represents the union of sets A and B, with the overlapping area included just once. Here’s another way of illustrating the union operation, using two sets of  integers:

A = {1, 2, 4, 7, 9}
B = {3, 5, 7, 9}
A union B = {1, 2, 3, 4, 5, 7, 9}

In this example, set A contains 5 integers, and set B contains 4 integers. Each set has a couple of unique values, but both sets share the values 7 and 9. The union of A and B yields a total of 7 values, with the integers 7 and 9 included just once.

Next, let’s look at the area shared by two sets, known as the intersection. Figure 4-2 shows the graphical depiction of an intersection.

Figure 4-2. Intersection of two sets

Using the same sets of integers as the last example, you can see that the intersection of A and B contains only the numbers 7 and 9:

A = {1, 2, 4, 7, 9}
B = {3, 5, 7, 9}
A intersect B = {7, 9}

If there is no overlap between the two sets, the intersection would be empty, and intersect is the only set operator that can yield an empty set.

The third and last diagram shown in Figure 4-3 illustrates an operation where the contents of one set is returned minus any overlap with another set.

Figure 4-3. Minus of two sets

Known as the except operation (but you may also use minus interchangeably), this operation yields all of set A without any overlap with set B. Here’s the same operation using the numeric sets:

A = {1, 2, 4, 7, 9}
B = {3, 5, 7, 9}
A except B = {1, 2, 4}

Of the three set operations, except is the only one that yields a different result if you switch the order of the sets. Here’s the same example, but showing the result of B except A:

A = {1, 2, 4, 7, 9}
B = {3, 5, 7, 9}
B except A = {3, 5}

These three set operators cover many situations, but how might you describe the case shown in Figure 4-4?

Figure 4-4. Mystery operation

The operation illustrated in Figure 4-4 cannot be achieved using one of the three set operators described previously. It can be achieved, however, by combining multiple set operators. Here are two different ways to generate the shaded area shown in Figure 4-4:

(A union B) except (A intersect B)
  -- or --
(A except B) union (B except A)

The next sections in this chapter will show you how to put these set operators to work.

The union Operator

The union operator allows you to combine two data sets. Here’s a simple example:

PUBLIC>select 1 as numeric_col, 'ABC' as string_col
       union
       select 2 as numeric_col, 'XYZ' as string_col;
+-------------+------------+
| NUMERIC_COL | STRING_COL |
|-------------+------------|
|           1 | ABC        |
|           2 | XYZ        |
+-------------+------------+

This is about as simple as it gets: two single-row result sets, without any overlapping values, combining to make a two-row result set. This type of statement is known as a compound query because it contains two independent queries separated by a set operator. Before moving to more advanced examples, here are some restrictions related to the two sets being combined:

  • Both sets must contain the same number of columns.
  • The data types of the columns must match.

Here’s how Snowflake responds if the two sets have a different column count:

PUBLIC>select 1 as numeric_col, 'ABC' as string_col
       union
       select 2 as numeric_col, 'XYZ' as string_col, 
         99 as extra_col;
001789 (42601): SQL compilation error:
invalid number of result columns for set operator ...

In this case, you will see a very specific error message from Snowflake, but here’s the error thrown if the column types don’t match:

PUBLIC>select 1 as numeric_col, 'ABC' as string_col
       union
       select 'XYZ' as numeric_col, 2 as string_col;
100038 (22018): Numeric value 'ABC' is not recognized

This error message is more general in nature, but it tells you that given the data type used in the first column of the first query (the number 1), the value 'ABC' does not match.

As mentioned earlier, the union operator removes duplicates when constructing the result set, meaning that any overlaps are represented only once. Here are the same two sets used earlier ({1, 2, 4, 7, 9} and {3, 5, 7, 9}) built using the values subclause and then combined using union:

PUBLIC>select integer_val
       from (values (1), (2), (4), (7), (9))
       as set_a (integer_val)
       union
       select integer_val
       from (values (3), (5), (7), (9))
       as set_b (integer_val);
+-------------+
| INTEGER_VAL |
|-------------|
|           1 |
|           2 |
|           4 |
|           7 |
|           9 |
|           3 |
|           5 |
+-------------+

As you can see, both sets include 7 and 9, but these values only appear once in the result set because the union operator sorts the values and removes duplicates.

While this is the default behavior, there are cases when you don’t want the duplicates to be removed, in which case you can use union all:

PUBLIC>select integer_val
       from (values (1), (2), (4), (7), (9))
       as set_a (integer_val)
       union all
       select integer_val
       from (values (3), (5), (7), (9))
       as set_b (integer_val);
+-------------+
| INTEGER_VAL |
|-------------|
|           1 |
|           2 |
|           4 |
|           7 |
|           9 |
|           3 |
|           5 |
|           7 |
|           9 |
+-------------+

The result set now includes nine rows instead of seven, with the duplicate 7s and 9s making up the extra two rows.

The sample database includes data from 1992 through 1998; here’s a query that uses union to find all customers who placed orders greater than $350,000 in 1992 or 1993:

PUBLIC>select distinct o_custkey
       from orders
       where o_totalprice > 350000
         and date_part(year, o_orderdate) = 1992
       union
       select distinct o_custkey
       from orders
       where o_totalprice > 350000
         and date_part(year, o_orderdate) = 1993;
+-----------+
| O_CUSTKEY |
|-----------|
|    114856 |
|    116683 |
|      4334 |
|     61790 |
|     24275 |
|     20665 |
|    130745 |
|     70054 |
|    122047 |
|    123934 |
... <847 rows omitted>
|    102304 |
|    106813 |
|    109519 |
|    120437 |
|    131131 |
|    130987 |
|    108745 |
|      5330 |
|     95137 |
|     59665 |
+-----------+

This query could have been written with a single filter condition and without the union operator, but this form of the query will be useful to compare the results of intersect and except in the next two sections.

The intersect Operator

As described earlier in the chapter (see “Set Theory Primer”), the intersect operator is used to find the intersection between two data sets. Here’s an example using the two numeric data sets from earlier:

PUBLIC>select integer_val
       from (values (1), (2), (4), (7), (9))
       as set_a (integer_val)
       intersect
       select integer_val
       from (values (3), (5), (7), (9))
       as set_b (integer_val);
+-------------+
| INTEGER_VAL |
|-------------|
|           7 |
|           9 |
+-------------+

The overlap of sets A and B are the values 7 and 9; all other values are unique to each set.

In the previous section, the query against the Orders table using union returned 867 rows. Let’s see what happens when union is changed to intersect for the same query:

PUBLIC>select distinct o_custkey
       from orders
       where o_totalprice > 350000
         and date_part(year, o_orderdate) = 1992
       intersect
       select distinct o_custkey
       from orders
       where o_totalprice > 350000
         and date_part(year, o_orderdate) = 1993;
+-----------+ 
| O_CUSTKEY |
|-----------|
|    100510 |
+-----------+

While there are 867 different customers who placed orders over $350,000 in 1992 or 1993, only one of them placed such orders in both years.

The except Operator

The third and last of the set operators is except, which is used to return rows from set A without any overlap with set B. Here’s an example using the two numeric data sets from earlier:

PUBLIC>select integer_val
       from (values (1), (2), (4), (7), (9))
       as set_a (integer_val)
       except
       select integer_val
       from (values (3), (5), (7), (9))
       as set_b (integer_val);
+-------------+ 
| INTEGER_VAL |
|-------------|
|           1 |
|           2 |
|           4 |
+-------------+

This operation yields the set {1, 2, 4}, which are the three values found in set A but not in set B. Switching the order of the two sets yields a different result:

PUBLIC>select integer_val
       from (values (3), (5), (7), (9))
       as set_b (integer_val)
       except
       select integer_val
       from (values (1), (2), (4), (7), (9))
       as set_a (integer_val);
+-------------+ 
| INTEGER_VAL |
|-------------|
|           3 |
|           5 |
+-------------+

Unlike union and intersect, which will yield the same results regardless of which set is above or below the set operator, keep in mind that A except B will usually yield different results than B except A.

Returning to the Orders query, here are the results when using except:

PUBLIC>select distinct o_custkey
       from orders
       where o_totalprice > 350000
         and date_part(year, o_orderdate) = 1992
       except
       select distinct o_custkey
       from orders
       where o_totalprice > 350000
         and date_part(year, o_orderdate) = 1993;
+-----------+
| O_CUSTKEY |
|-----------|
|    134878 |
|    125183 |
|     44240 |
|     79138 |
|     71119 |
... <444 rows omitted>
|     74929 |
|    149240 |
|     23587 |
|     79999 |
|     46490 |
+-----------+

This query returns 454 customers and when the order of the sets is reversed (orders from 1993 first), the query returns 412 rows.

Set Operation Rules

The following sections outline some rules to keep in mind when working with set operators.

Sorting Compound Query Results

If you want your results to be sorted when using set operators, you need to abide by the following rules:

  • There may only be one order by clause, and it must be at the end of the statement.
  • When specifying column names in the order by clause, you must use the column names/aliases from the first query.

Regarding the second point, the individual queries used as part of a compound statement may retrieve data from different tables having different column names, so the column names used in the first query determine what can be specified in the order by clause. Here’s an example:

PUBLIC>select distinct o_orderdate from orders
       intersect
       select distinct l_shipdate from lineitem
       order by o_orderdate;
+-------------+ 
| O_ORDERDATE |
|-------------|
| 1992-01-03  |
| 1992-01-04  |
| 1992-01-05  |
| 1992-01-06  |
| 1992-01-07  |
... <2,394 rows omitted>
| 1998-07-29  |
| 1998-07-30  |
| 1998-07-31  |
| 1998-08-01  |
| 1998-08-02  |
+-------------+

In this example, the column used for sorting must be by o_orderdate rather than l_shipdate; here’s what happens if you specify l_shipdate as the sort column:

PUBLIC>select distinct o_orderdate from orders
       intersect
       select distinct l_shipdate from lineitem
       order by l_shipdate;
000904 (42000): SQL compilation error: error line 4 at position 9
invalid identifier 'L_SHIPDATE'

To avoid confusion in these cases, you can specify matching column aliases for both queries and use the alias in your order by clause.

Set Operation Precedence

If your compound query contains more than two queries using different set operators, you need to consider the order in which to place the queries in order to achieve the desired results. Figure 4-5 shows the “mystery operation” introduced earlier as Figure 4-4.

Figure 4-5. Mystery operation

Let’s see how this could be accomplished using the two simple numeric sets used throughout this chapter:

PUBLIC>select integer_val
       from (values (1), (2), (4), (7), (9))
       as set_a (integer_val);
+-------------+ 
| INTEGER_VAL |
|-------------|
|           1 |
|           2 |
|           4 |
|           7 |
|           9 |
+-------------+

PUBLIC>select integer_val
       from (values (3), (5), (7), (9))
       as set_b (integer_val);
+-------------+
| INTEGER_VAL |
|-------------|
|           3 |
|           5 |
|           7 |
|           9 |
+-------------+

As mentioned earlier in the chapter, there are multiple ways to generate the shaded area in Figure 4-5, which in this case is the set {1, 2, 3, 4, 5}, but let’s choose the second one:

(A except B) union (B except A)

Here’s how that might be constructed for the two numeric sets:

PUBLIC>select integer_val
       from (values (1), (2), (4), (7), (9))
       as set_a (integer_val)
       except
       select integer_val
       from (values (3), (5), (7), (9))
       as set_b (integer_val)
       union
       select integer_val
       from (values (3), (5), (7), (9))
       as set_b (integer_val)
       except
       select integer_val
       from (values (1), (2), (4), (7), (9))
       as set_a (integer_val);
+-------------+ 
| INTEGER_VAL |
|-------------|
|           3 |
|           5 |
+-------------+

All the pieces seem to be in place and in the right order, but the results don’t seem to be correct. The issue here is that the server is applying these operations from the top down, whereas these operations need to be grouped together. Specifically, the two except operations need to be done independently, and the two resulting sets can then be joined together using union. To do so, you must use parentheses:

PUBLIC>(select integer_val
        from (values (1), (2), (4), (7), (9))
        as set_a (integer_val)
        except
        select integer_val
        from (values (3), (5), (7), (9))
        as set_b (integer_val)
        )
       union
       (select integer_val
        from (values (3), (5), (7), (9))
        as set_b (integer_val)
        except
        select integer_val
        from (values (1), (2), (4), (7), (9))
        as set_a (integer_val)
        );
+-------------+ 
| INTEGER_VAL |
|-------------|
|           1 |
|           2 |
|           4 |
|           5 |
|           3 |
+-------------+

When building compound queries with three or more set operators, make sure you consider how the different operations may need to be grouped in order to achieve the desired results.

Wrap-Up

In this chapter, you learned how the set operators unionintersect, and except can be used to identify overlaps between multiple data sets. You were also introduced to strategies for combining these operations in order to identify complex relationships between data sets. While you may not see these set operators used in many examples in later chapters, keep these powerful tools in mind when performing data analysis or crafting reports.

Test Your Knowledge

The following exercises are designed to test your understanding of set operators. Please see “Chapter 4” in Appendix B for solutions.

Exercise 4-1

Given the following two sets:

A = {3, 5, 7, 9}
	
B = {4, 5, 6, 7, 8}

Which sets are generated by each of the following operations?

  • A union B
  • A union all B
  • A intersect B
  • A except B
  • B except A

Exercise 4-2

Write a compound query that returns the names of all regions (Region.r_name) and nations (Nation.n_name) that start with the letter A.

Exercise 4-3

Modify the query from Exercise 4-2 to sort by name (the default sort is fine).

Exercise 4-4

Given the following three sets:

A = {3, 5, 7, 9}
	
B = {4, 5, 6, 7, 8}

C = {8, 9, 10}

What set is returned by the following operation?

(A except B) intersect C

Get Learning Snowflake SQL and Scripting now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.