Chapter 1. A SELECT Review
This chapter represents a very quick review of SQL elements used throughout this book. In particular, we will cover the following topics:
-
Simple
SELECT
statements (see the next section) -
Common table expressions (see )
-
In
CASE
of emergency (see )
These will all be used in examples in the remainder of the book, so let’s take some time and get used to them here.
Simple SELECT Statements
In the beginning was the word, and the word was SELECT
. To get anything out of a relational database table, you need SELECT
. Here it is in its simplest form (to save space, only the first 10 rows are shown in the result set, and we’re counting DC as a state):
SELECT
*
FROM
crm
.
CustomerCountByState
;
State | Total | |
---|---|---|
0 | AK | 6 |
1 | AL | 0 |
2 | AR | 1 |
3 | AZ | 9 |
4 | CA | 72 |
5 | CO | 9 |
6 | CT | 5 |
7 | DC | 1 |
8 | DE | 0 |
9 | FL | 28 |
That gives us a “raw” dump of all the columns in the table (the *
), in whatever order the rows may happen to be stored in the database table (database order).
Now we will switch to a bit “wider” dataset with more columns. Let’s take the top 10 by whatever states happen to sort first in the customer dataset. We’re also going to be choosy about which columns we want and not just grab them all:
SELECT
TOP
10
/* Because we only want 10 */
LastName
,
FirstName
,
City
,
State
FROM
crm
.
NormalizedCustomer
ORDER
BY
State
;
-- This is the filter for the TOP 10
LastName | FirstName | City | State | |
---|---|---|---|---|
0 | Campain | Roxane | Fairbanks | AK |
1 | Ferencz | Erick | Fairbanks | AK |
2 | Giguere | Wilda | Anchorage | AK |
3 | Kitty | Gail | Anchorage | AK |
4 | Paprocki | Lenna | Anchorage | AK |
5 | Weight | Penney | Anchorage | AK |
6 | Deleo | Carin | Little Rock | AR |
7 | Borgman | Keneth | Phoenix | AZ |
8 | Eschberger | Christiane | Phoenix | AZ |
9 | Kannady | Regenia | Scottsdale | AZ |
Let’s stop and take a look at all of that. First, refer to the SQL SELECT
itself. I tend to always break up my SQL statements in this manner. It makes understanding the different clauses easier. I am able to quickly scan the “pattern” of this style. Then we should see a table with the 10 rows of 4 columns we asked for.
Let’s keep going. How many customers do we have in each state? The following code is a common pattern—I use it all the time while doing exploratory data analysis (EDA). This example may not count true unique customers if there are duplicates in our data (often likely, even in a production CRM system), but this will be close enough for our example:
SELECT
TOP
10
State
,
COUNT
(
*
)
Total
/* "Total" is the name assigned to the column */
FROM
crm
.
NormalizedCustomer
GROUP
BY
State
/* Required for aggregate functions */
ORDER
BY
2
DESC
,
1
;
-- States with most customers first
State | Total | |
---|---|---|
0 | CA | 72 |
1 | NJ | 52 |
2 | NY | 46 |
3 | TX | 32 |
4 | PA | 29 |
5 | FL | 28 |
6 | OH | 22 |
7 | MD | 17 |
8 | IL | 15 |
9 | MI | 14 |
Note
In most examples, I am going to use something like SELECT TOP 10
to limit the resulting output, as shown here.
In this type of distribution query, we are looking at one (or more) columns and doing a COUNT(*)
on the number of rows for each group (we will cover COUNT
in Chapter 2 on useful SQL functions). Note we can give the output of the COUNT
a column name in the results, in this case Total.
The results show us a nice distribution with a “long tail” of 1s in the Total column. The GROUP BY
is required and needs to contain all columns that aren’t part of the aggregation (more on that in a moment).
The ORDER BY
is interesting. Here I am sorting the total of the most common states first (2 DESC
specifies the second column in the result set, in descending order) and then the first column in the result set (the 1
). Normally, using column ordinals in this clause is frowned on in production code because if the column list changes, you often have to go and rework the ORDER BY
. Remove a column and suddenly your ordinal may be out of range, causing an error. But we aren’t doing production code in this book. I often use this shorthand form to save typing when using the pattern:
SELECT
Foo
,
COUNT
(
*
)
Total
FROM
Bar
GROUP
BY
Foo
ORDER
BY
2
DESC
,
1
;
Many dialects support using ordinals in the GROUP BY
as well. Again, I don’t think using them is a good idea for anything that might be pushed into production.
If I didn’t want to use ordinals, I would do the following. This is the same query, but I am not using ordinals, and I am giving meaningful names to the result columns:
SELECT
TOP
10
State
[
Top
10
States
],
/* You can put spaces in column names to make */
COUNT
(
*
)
[
Total
Customers
]
/* the results more readable (and if you do, */
/* surround the name with square brackets). */
FROM
crm
.
NormalizedCustomer
GROUP
BY
State
ORDER
BY
COUNT
(
*
)
DESC
,
State
;
-- Note using the COUNT function again
Top 10 States | Total Customers | |
---|---|---|
0 | CA | 72 |
1 | NJ | 52 |
2 | NY | 46 |
3 | TX | 32 |
4 | PA | 29 |
5 | FL | 28 |
6 | OH | 22 |
7 | MD | 17 |
8 | IL | 15 |
9 | MI | 14 |
Common Table Expressions
That second COUNT(*)
brings up another point. We can use functions in other places, such as the WHERE
or ORDER BY
clause. What if instead of the top 10, we simply want all states with 10 or more customers? We wouldn’t know that answer in advance, so we couldn’t use it in a SELECT TOP 10
(or LIMIT 10
) clause. Here is one way to do it, using common table expressions (CTEs), aka the WITH
statement:
/*
Think of a CTE as a temporary "view,"
although it isn't as performant. In
this case the CTE name is CustomerCountByState.
*/
WITH
CustomerCountByState
AS
(
SELECT
State
,
COUNT
(
*
)
Total
FROM
crm
.
NormalizedCustomer
GROUP
BY
State
)
/*
Now we can use our CTE.
*/
SELECT
State
[
States
w
/
More
Than
10
Customers
],
Total
[
#
in
State
]
FROM
CustomerCountByState
WHERE
Total
>
9
/* Or >= 10, your choice */
ORDER
BY
Total
DESC
,
State
;
States w/More Than 10 Customers | # in State | |
---|---|---|
0 | CA | 72 |
1 | NJ | 52 |
2 | NY | 46 |
3 | TX | 32 |
4 | PA | 29 |
5 | FL | 28 |
6 | OH | 22 |
7 | MD | 17 |
8 | IL | 15 |
9 | MI | 14 |
10 | MA | 12 |
11 | WI | 11 |
12 | TN | 10 |
CTEs aren’t efficient. If you use them in multiple places in a query, they act more like a macro expansion, not a true view. But I like them because they allow me to build up queries in an orderly manner that others can understand. There are many ways to solve the preceding problem in SQL. A subquery would get us to the same place but in my opinion isn’t as readable. Basically you can think of a CTE as lifting the subquery in the following SELECT
out and predefining it and giving it a name. I have aliased the subquery CustomerCountByState so that the similarity to the preceding CTE is explicit:
SELECT
State
[
States
w
/
More
Than
10
Customers
],
Total
[
#
in
State
]
FROM
(
SELECT
State
,
COUNT
(
*
)
Total
FROM
crm
.
NormalizedCustomer
GROUP
BY
State
)
CustomerCountByState
/* Gotta give the subquery an alias.
It can be anything. I often use "A"
for simple queries like this, but am
naming it the same as the CTE to be
explicit. */
WHERE
Total
>
9
ORDER
BY
Total
DESC
,
State
;
States w/More Than 10 Customers | # in State | |
---|---|---|
0 | CA | 72 |
1 | NJ | 52 |
2 | NY | 46 |
3 | TX | 32 |
4 | PA | 29 |
5 | FL | 28 |
6 | OH | 22 |
7 | MD | 17 |
8 | IL | 15 |
9 | MI | 14 |
10 | MA | 12 |
11 | WI | 11 |
12 | TN | 10 |
Note
In many cases I will use CTEs during EDA and then, as I get closer to rolling something into production, change them into proper views with the same names. Hence:
-- This:
WITH
Foo
AS
(
SELECT
*
FROM
Bar
WHERE
Status
=
'Foo'
)
SELECT
*
FROM
Foo
;
-- Becomes this:
CREATE
VIEW
Foo
AS
(
SELECT
*
FROM
Bar
WHERE
Status
=
'Foo'
)
The query can remain unchanged (presuming it is running in the default schema), and the view can be used in other queries as well:
SELECT
*
FROM
Foo
;
The interesting thing about CTEs is I often chain them in a row, each building and filtering on the one before it:
WITH
CustomerCountByState
AS
(
SELECT
State
,
COUNT
(
*
)
Total
FROM
crm
.
NormalizedCustomer
GROUP
BY
State
),
CitiesInTopStates
AS
(
SELECT
City
,
State
,
COUNT
(
*
)
Total
FROM
crm
.
NormalizedCustomer
WHERE
/*
States with more than 4 (or >= 5) customers.
*/
State
IN
(
SELECT
State
FROM
CustomerCountByState
WHERE
Total
>
4
)
GROUP
BY
City
,
State
)
SELECT
City
,
State
,
Total
FROM
CitiesInTopStates
/*
Cities with 5 or more (or > 4) customers.
*/
WHERE
Total
>=
5
ORDER
BY
Total
DESC
,
City
,
State
;
City | State | Total | |
---|---|---|---|
0 | New York | NY | 14 |
1 | Philadelphia | PA | 8 |
2 | Chicago | IL | 7 |
3 | Miami | FL | 6 |
4 | Raton | NM | 6 |
5 | Baltimore | MD | 5 |
6 | Gardena | CA | 5 |
7 | Milwaukee | WI | 5 |
8 | Orlando | FL | 5 |
9 | Phoenix | AZ | 5 |
10 | San Francisco | CA | 5 |
Note in the following example that a CTE can be a nice way to get you away from using ordinals or SQL functions in your ORDER BY
clause, but I wouldn’t recommend using them only for that:
WITH
TopTenStates
AS
(
SELECT
State
,
COUNT
(
*
)
Total
FROM
crm
.
NormalizedCustomer
GROUP
BY
State
)
SELECT
TOP
10
State
[
Top
10
States
],
Total
[
Total
Customers
]
FROM
TopTenStates
ORDER
BY
Total
DESC
,
State
;
Top 10 States | Total Customers | |
---|---|---|
0 | CA | 72 |
1 | NJ | 52 |
2 | NY | 46 |
3 | TX | 32 |
4 | PA | 29 |
5 | FL | 28 |
6 | OH | 22 |
7 | MD | 17 |
8 | IL | 15 |
9 | MI | 14 |
In CASE of Emergency
I use the CASE
expression a lot. It is basically an expression form of a switch statement in a C-syntax programming language. Remember your ancient history, like this example:
/* Ye olde C-style switch statement - brought to you by K&R. */
switch
(
foo
)
{
case
0
:
bar
=
"It was zero."
;
break
;
case
1
:
bar
=
"It was one."
;
break
;
default
:
bar
=
"Dunno what it was."
;
break
;
}
You can achieve similar results in SQL as follows:
SELECT
CASE
foo
WHEN
0
THEN
'It was zero.'
WHEN
1
THEN
'It was one.'
ELSE
'Dunno what it was'
END
bar
FROM
xyzzy
;
In this book, you will see me use all three of the styles used in the following code block for varying reasons, including mood and the space available to show an example (including whether I am emphasizing the CASE
or something else in the query).
Consider this simple table:
City | USPSCity |
---|---|
O’Fallon | OFallon |
How do we check if the two cities are “the same”? The following code shows how. I expand upon this technique later in the book; this example is instead to show three styles of CASE
expressions all checking for the same thing:
/*
We will cover this example in depth in Chapter 3.
*/
WITH
NormalizeCities
AS
(
SELECT
City
,
USPSCity
,
REPLACE
(
TRANSLATE
(
City
,
'.,-''()[]`'
,
' '
)
,
' '
,
''
)
NormCity
,
REPLACE
(
TRANSLATE
(
USPSCity
,
'.,-''()[]`'
,
' '
)
,
' '
,
''
)
NormUSPSCity
FROM
Foo
)
SELECT
City
,
USPSCity
[
USPS
City
]
,
/*
Single-line CASE statement.
*/
CASE
City
WHEN
USPSCity
THEN
'Match!'
ELSE
'No Match!'
END
[
Match
?
]
,
NormCity
[
Norm
.
City
]
,
NormUSPSCity
[
Norm
.
USPS
City
]
,
/*
Multi-line CASE with expression test after the CASE.
*/
CASE
NormCity
WHEN
NormUSPSCity
THEN
'Match!'
ELSE
'No Match!'
END
[
Do
Normalized
Match
?
]
,
/*
Multi-LINE CASE with expression test in the WHEN. Note "=".
*/
CASE
WHEN
NormCity
=
NormUSPSCity
THEN
'Match!'
ELSE
'No Match!'
END
[
Still
?
]
FROM
NormalizeCities
;
City | USPS City | Match? | Norm. City | Norm. USPS City | Do Normalized Match? | Still? |
---|---|---|---|---|---|---|
O’Fallon | OFallon | No Match! | OFallon | OFallon | Match! | Match! |
Gee, that’s a lot! Let’s take that apart piece by piece. The WITH
statement creating the NormalizeCities CTE is simply making the example SELECT
query CASE
statements simpler (the REPLACE
and TRANSLATE
functions will be explained in Chapter 2).
The SELECT
is working completely with the SQL variables—no table, view, or CTE required. We are simply looking at different ways to express the CASE
expression, so that’s what I want to emphasize here. The four columns in the SELECT
dumping out the various forms of the city are obvious. So we will look at the three CASE
expressions in order. All are basically a simple “if…then…else” expression—if a match occurs, then return “Match!”, else “No Match!” For something as simple as this, see also the IIF
SQL function in Chapter 2.
The first CASE
is expressed all in a line:
CASE
City
WHEN
USPSCity
THEN
'Match!'
ELSE
'No Match!'
END
[
Match
?
]
Note
You can read it like this: “In the CASE
of the value in City, WHEN
it is equal to USPSCity THEN
return the value ‘Match!’, ELSE
return the value ‘No Match!’, and name the result (the column in the result set) Match?”
The next one varies only in adding new lines and indentation to make the syntactic elements more obvious to the script reader (that’s you). It returns a column named Do Normalized Match?:
CASE
NormCity
WHEN
NormUSPSCity
THEN
'Match!'
ELSE
'No Match!'
END
[
Do
Normalized
Match
?
],
The third is of more interest:
CASE
WHEN
NormCity
=
NormUSPSCity
THEN
'Match!'
ELSE
'No Match!'
END
[
Still
?
]
In this case (ahem) there is no value following the CASE
. Instead, the WHEN
has the test clause in it, followed by the THEN
.
Note
Think of it like this: “In the following CASE
s, WHEN
the value of NormCity is equal to the value of NormUSPSCity, THEN
return the value ‘Match!’, ELSE
return the value ‘No Match!’ and name the result Still?”
This is often the preferred form, because it allows more complexity. See the next example.
As mentioned, the SQL function IIF
does a similar job (as can COALESCE
or ISNULL
), so why go to all the complications of a CASE
expression? For something like this, I wouldn’t. However, let’s change the city and see how we can use CASE
to handle a much more complicated subject. We will only use the last style of CASE
expression for this so we can look at each test line by line:
DECLARE
@
City
VARCHAR
(
50
)
=
'St Louis'
/*
How we store it
*/
DECLARE
@
ImportCity
VARCHAR
(
50
)
=
'Saint Louis'
/*
How Marketing receives it
*/
SELECT
@
City
City
,
@
ImportCity
[
Imported
City
]
,
CASE
/*
Really crude approach, but effective. Note trailing spaces in each match term to make sure and not find embedded "st" instances, for example.
*/
WHEN
REPLACE
(
@
City
,
'St '
,
'Saint '
)
=
@
ImportCity
THEN
'Match!'
WHEN
REPLACE
(
@
City
,
'St. '
,
'Saint '
)
=
@
ImportCity
THEN
'Match!'
WHEN
REPLACE
(
@
City
,
'Ste '
,
'Sainte '
)
=
@
ImportCity
THEN
'Match!'
WHEN
REPLACE
(
@
City
,
'Ste. '
,
'Sainte '
)
=
@
ImportCity
THEN
'Match!'
/*
And so on, and so forth...however many CASEs you got...
*/
ELSE
'No Match!'
END
[
Result
?
]
;
City | Imported City | Result? |
---|---|---|
St Louis | Saint Louis | Match! |
Even that’s not too exciting, since it could be replaced by a single IIF
function (explained in Chapter 2):
SELECT
@
City
City
,
@
ImportCity
[
Imported
City
],
IIF
(
REPLACE
(
@
City
,
'St '
,
'Saint '
)
=
@
ImportCity
OR
REPLACE
(
@
City
,
'St. '
,
'Saint '
)
=
@
ImportCity
OR
REPLACE
(
@
City
,
'Ste '
,
'Sainte '
)
=
@
ImportCity
OR
REPLACE
(
@
City
,
'Ste. '
,
'Sainte '
)
=
@
ImportCity
,
'Match!'
,
'No Match!'
)
[
Result
?
];
That then brings us back to why we should use CASE
over something simpler like IIF
. How about this? The LEN
, LEFT
, and RIGHT
functions will also be explained in Chapter 2.
DECLARE
@
City
VARCHAR
(
50
)
=
'Ste Genevieve'
/* How we store it */
DECLARE
@
ImportCity
VARCHAR
(
50
)
=
'St Genevieve'
/* How Marketing receives it */
SELECT
@
City
City
,
@
ImportCity
[
Imported
City
],
CASE
/*
WHENs can actually check all variety of tests.
*/
WHEN
REPLACE
(
@
City
,
'St '
,
'Saint '
)
=
@
ImportCity
THEN
'Match!'
WHEN
REPLACE
(
@
City
,
'St. '
,
'Saint '
)
=
@
ImportCity
THEN
'Match!'
WHEN
REPLACE
(
@
City
,
'Ste '
,
'Sainte '
)
=
@
ImportCity
THEN
'Match!'
WHEN
REPLACE
(
@
City
,
'Ste. '
,
'Sainte '
)
=
@
ImportCity
THEN
'Match!'
WHEN
LEFT
(
@
ImportCity
,
LEN
(
'St '
))
=
'St '
AND
RIGHT
(
@
ImportCity
,
LEN
(
'Genevieve'
))
=
'Genevieve'
THEN
'Chauvinist!'
WHEN
LEFT
(
@
ImportCity
,
LEN
(
'St. '
))
=
'St. '
AND
RIGHT
(
@
ImportCity
,
LEN
(
'Genevieve'
))
=
'Genevieve'
THEN
'Chauvinist!'
ELSE
'No Match!'
END
[
Result
?
];
City | Imported City | Result? |
---|---|---|
Ste Genevieve | St Genevieve | Chauvinist! |
Here, we were checking beyond simple tests of equality to look and see if there are any underlying data quality issues, which we politely call out (since the town and person are both of French origin and Genevieve is feminine, the proper title is “Sainte,” not “Saint”).
Joins
We’ve covered simple SELECT
statements. Let’s look at joins next. Like belly buttons, there are basically two types: INNER
and OUTER
. There are also LEFT
and RIGHT
joins, but for our purposes I always use LEFT
. Let’s look at an INNER JOIN
first. For every row in the customer data, find the corresponding “friendly” state name in the postal reference table. You may not know it using the SQL INNER JOIN
syntax; you may know it simply as a correlated SELECT
:
SELECT
TOP
10
C
.
LastName
[
Last
Name
],
C
.
FirstName
[
First
Name
],
C
.
City
,
A
.
StateOrProvince
State
/*
Both tables are in the FROM
*/
FROM
crm
.
NormalizedCustomer
C
,
ref
.
PostalAbbreviations
A
WHERE
C
.
State
=
A
.
Abbreviation
ORDER
BY
C
.
LastName
,
C
.
FirstName
;
Last Name | First Name | City | State | |
---|---|---|---|---|
0 | NULL | NULL | Jefferson City | Missouri |
1 | NULL | NULL | Boulder | Colorado |
2 | NULL | NULL | Raton | New Mexico |
3 | Abdallah | Johnetta | Chapel Hill | North Carolina |
4 | Acey | Geoffrey | Palatine | Illinois |
5 | Acuff | Weldon | Arlington Heights | Illinois |
6 | Adkin | Barbra | Brooklyn | New York |
7 | Agramonte | Fausto | New York | New York |
8 | Ahle | Delmy | Providence | Rhode Island |
9 | Albares | Cammy | Laredo | Texas |
We can also use the INNER JOIN...ON
syntax to be more explicit about our intentions:
SELECT
TOP
10
C
.
LastName
[
Last
Name
],
C
.
FirstName
[
First
Name
],
C
.
City
,
A
.
StateOrProvince
State
FROM
crm
.
NormalizedCustomer
C
INNER
JOIN
ref
.
PostalAbbreviations
A
ON
C
.
State
=
A
.
Abbreviation
ORDER
BY
C
.
LastName
,
C
.
FirstName
;
Last Name | First Name | City | State | |
---|---|---|---|---|
0 | NULL | NULL | Jefferson City | Missouri |
1 | NULL | NULL | Boulder | Colorado |
2 | NULL | NULL | Raton | New Mexico |
3 | Abdallah | Johnetta | Chapel Hill | North Carolina |
4 | Acey | Geoffrey | Palatine | Illinois |
5 | Acuff | Weldon | Arlington Heights | Illinois |
6 | Adkin | Barbra | Brooklyn | New York |
7 | Agramonte | Fausto | New York | New York |
8 | Ahle | Delmy | Providence | Rhode Island |
9 | Albares | Cammy | Laredo | Texas |
Note the two result sets are identical. I will tend to use explicit JOIN
syntax throughout this book.
The thing to remember with inner joins is if there isn’t a match between the two, then a row doesn’t get sent to the result set. INNER
is like “intersection” if you remember sets from math. For example, our customer data contains only United States states plus the District of Columbia. If there were a customer from the Canadian province of New Brunswick, then that customer’s data would not be included in the preceding results.
One thing that’s new in both examples is we’ve given each table an alias: “C” for NormalizedCustomer and “A” for PostalAbbreviations. This helps distinguish which columns we are talking about when they both share the same name: is it the column in the customer data or the one in the reference data? When there are no columns with the same names between the two tables, the aliases aren’t needed, but I find them good practice to always use them on joins, again to be explicit about my intentions. Notice we also gave more user-friendly column names in the result set, with spaces in Last Name and First Name instead of just CamelCase, e.g., LastName.
Also, on a JOIN
the ON
is not a complete replacement for the WHERE
clause. In fact, trying to make it so can lead to some surprising results.
A Diversion into NULL Values
Check this out: we have NULL
values in our CRM data in some name fields:
SELECT
LastName
,
FirstName
from
crm
.
NormalizedCustomer
WHERE
LastName
IS
NULL
;
LastName | FirstName | |
---|---|---|
0 | NULL | NULL |
1 | NULL | NULL |
2 | NULL | NULL |
We also have them in our import dataset:
SELECT
LastName
,
FirstName
from
staging
.
Customers
ORDER
BY
LastName
,
FirstName
;
LastName | FirstName | |
---|---|---|
0 | NULL | NULL |
1 | NULL | NULL |
2 | NULL | NULL |
3 | Abdallah | Johnetta |
4 | Acey | Geoffrey |
5 | Acuff | Weldon |
6 | Adkin | Barbra |
7 | Agramonte | Fausto |
8 | Ahle | Delmy |
9 | Albares | Cammy |
So why don’t they get found in a JOIN
?
SELECT
SC
.
LastName
,
SC
.
FirstName
FROM
staging
.
Customers
SC
INNER
JOIN
crm
.
NormalizedCustomer
NC
ON
SC
.
LastName
=
NC
.
LastName
AND
SC
.
FirstName
=
NC
.
FirstName
ORDER
BY
SC
.
LastName
,
SC
.
FirstName
;
LastName | FirstName | |
---|---|---|
0 | Abdallah | Johnetta |
1 | Acey | Geoffrey |
2 | Acuff | Weldon |
3 | Adkin | Barbra |
4 | Agramonte | Fausto |
5 | Ahle | Delmy |
6 | Albares | Cammy |
“Wait!” you exclaim. “I remember something about ANSI_NULLS
.” Let’s try that:
SET
ANSI_NULLS
OFF
;
SELECT
SC
.
LastName
,
SC
.
FirstName
FROM
staging
.
Customers
SC
INNER
JOIN
crm
.
NormalizedCustomer
NC
ON
SC
.
LastName
=
NC
.
LastName
AND
SC
.
FirstName
=
NC
.
FirstName
ORDER
BY
SC
.
LastName
,
SC
.
FirstName
;
LastName | FirstName | |
---|---|---|
0 | Abdallah | Johnetta |
1 | Acey | Geoffrey |
2 | Acuff | Weldon |
3 | Adkin | Barbra |
4 | Agramonte | Fausto |
5 | Ahle | Delmy |
6 | Albares | Cammy |
Hmmm…still missing those NULL
rows. Why? Because it is working as designed; as the documentation says, “SET ANSI_NULLS ON
affects a comparison only if one of the operands of the comparison is either a variable that is NULL
or a literal NULL
. If both sides of the comparison are columns or compound expressions, the setting does not affect the comparison.” This is where you have to use a WHERE
clause instead, and something like EXISTS
:
SELECT
SC
.
LastName
,
SC
.
FirstName
FROM
staging
.
Customers
SC
WHERE
EXISTS
(
SELECT
LastName
,
FirstName
FROM
crm
.
NormalizedCustomer
)
ORDER
BY
SC
.
LastName
,
SC
.
FirstName
LastName | FirstName | |
---|---|---|
0 | NULL | NULL |
1 | NULL | NULL |
2 | NULL | NULL |
3 | Abdallah | Johnetta |
4 | Acey | Geoffrey |
5 | Acuff | Weldon |
6 | Adkin | Barbra |
7 | Agramonte | Fausto |
8 | Ahle | Delmy |
9 | Albares | Cammy |
So, be aware when joining tables if any of the join columns have NULLs
in them, or you will have to change your approach.
OUTER JOINs
Now for an OUTER JOIN
. This is not a “real” example but is useful for our limited data so far. What states do we not have customers in? A left outer join will have NULL
for any values in the “right-side” table (in the following dbo.PotentialMatches) that don’t have a match from the “left-side” table:
SELECT
TOP
10
A
.
StateOrProvince
State
,
P
.
last_name
[
Last
Name
],
P
.
first_name
[
First
Name
],
P
.
company_name
Company
,
P
.
City
FROM
ref
.
PostalAbbreviations
A
/*
Attempt to join against our import data.
*/
LEFT
OUTER
JOIN
dbo
.
PotentialMatches
P
ON
A
.
Abbreviation
=
P
.
State
ORDER
BY
A
.
StateOrProvince
;
State | Last Name | First Name | Company | City | |
---|---|---|---|---|---|
0 | Alabama | NULL | NULL | NULL | NULL |
1 | Alaska | NULL | NULL | NULL | NULL |
2 | Arizona | NULL | NULL | NULL | NULL |
3 | Arkansas | NULL | NULL | NULL | NULL |
4 | California | Hamilton | Charlene | NULL | Santa Rosa |
5 | California | Rochin | Xuan | NULL | San Mateo |
6 | California | Waycott | Kanisha | NULL | L.A. |
7 | Colorado | NULL | NULL | Foo, Inc. | Boulder |
8 | Connecticut | Layous | Ma | NULL | North Haven |
9 | Delaware | NULL | NULL | NULL | NULL |
Those first four rows with NULL
values in the import data indicate there were four states found in the ref.PostalAbbreviations table (left) that had no matches within the import data (right).
Now we can add the WHERE
clause using this fact:
SELECT
TOP
10
A
.
StateOrProvince
[
1
st
10
States
Not
In
Import
File
]
FROM
ref
.
PostalAbbreviations
A
/*
Join against our customer data.
*/
LEFT
OUTER
JOIN
dbo
.
PotentialMatches
P
ON
A
.
Abbreviation
=
P
.
State
WHERE
P
.
State
IS
NULL
/* Where no match in import data */
ORDER
BY
A
.
StateOrProvince
;
1st 10 States Not In Import File | |
---|---|
0 | Alabama |
1 | Alaska |
2 | Arizona |
3 | Arkansas |
4 | Delaware |
5 | District of Columbia |
6 | Georgia |
7 | Idaho |
8 | Illinois |
9 | Iowa |
If the “left” and “right” seems confusing, think of the SELECT
as all on one line, reading from left to right:
SELECT
*
FROM
lefttable
l
LEFT
OUTER
JOIN
righttable
r
ON
l
.
somecol
=
r
.
somecol
In a left OUTER JOIN
, the columns from the right table will have NULL
s in the result set if there wasn’t a match.
The preceding is a bit contrived and may not be useful, however. There could be lots of states without potential customers. Here’s a better question: What states do we not have current customers in? Where do we need to concentrate our sales efforts?
SELECT
TOP
10
A
.
StateOrProvince
[
State
Not
In
Import
File
]
FROM
ref
.
PostalAbbreviations
A
/*
Join against our customer data.
*/
LEFT
OUTER
JOIN
crm
.
NormalizedCustomer
C
ON
A
.
Abbreviation
=
C
.
State
WHERE
C
.
State
IS
NULL
/* Where no match in customer data */
ORDER
BY
A
.
StateOrProvince
;
State Not In Import File | |
---|---|
0 | Alabama |
1 | Delaware |
2 | Vermont |
3 | West Virginia |
Of course, we could have gotten the same result with a sub-SELECT
. Which is clearer? I will let you decide:
SELECT
StateOrProvince
[
State
Without
Customers
]
FROM
ref
.
PostalAbbreviations
WHERE
Abbreviation
NOT
IN
(
SELECT
State
FROM
crm
.
NormalizedCustomer
)
ORDER
BY
StateOrProvince
;
State Without Customers | |
---|---|
0 | Alabama |
1 | Delaware |
2 | Vermont |
3 | West Virginia |
Obviously the reverse is useful, too, and actually the more common case:
SELECT
COUNT
(
DISTINCT
A
.
StateOrProvince
)
[
#
States
With
Customers
]
FROM
ref
.
PostalAbbreviations
A
LEFT
OUTER
JOIN
crm
.
NormalizedCustomer
C
ON
A
.
Abbreviation
=
C
.
State
WHERE
C
.
State
IS
NOT
NULL
;
-- At least 1 match in customer data
# States With Customers |
---|
47 |
Let’s go back to the import data. It wasn’t useful to ask what states weren’t in it; it could be many. It wasn’t useful to do a join to ask what states were in it; we can just extract that with a SELECT DISTINCT
or a GROUP BY State
. But import data is noisy. Are there any records in the import data that aren’t from a valid United States state or the District of Columbia? For once I will show a RIGHT OUTER JOIN
, just to keep the tables in the SELECT
in the same order:
SELECT
P
.
State
[
Invalid
State
In
Import
File
]
FROM
ref
.
PostalAbbreviations
A
/*
Join against our customer data.
*/
RIGHT
OUTER
JOIN
dbo
.
PotentialMatches
P
ON
A
.
Abbreviation
=
P
.
State
WHERE
A
.
Abbreviation
IS
NULL
/* Where no match in left dataset */
ORDER
BY
P
.
State
;
Invalid State In Import File | |
---|---|
0 | NULL |
1 | NB |
We see the import file has at least one missing state (not unusual) and one for “NB”. What’s that?
SELECT
P
.
country
,
P
.
state
[
Invalid
State
In
Import
File
]
FROM
ref
.
PostalAbbreviations
A
/*
Join against our customer data.
*/
RIGHT
OUTER
JOIN
dbo
.
PotentialMatches
P
ON
A
.
Abbreviation
=
P
.
state
WHERE
A
.
Abbreviation
IS
NULL
/* Where no match in left dataset */
ORDER
BY
P
.
state
;
Country | Invalid State In Import File | |
---|---|---|
0 | U.S. | NULL |
1 | Canada | NB |
Ah, “NB” is in Canada, so it is New Brunswick! One mystery solved, another opened. What are we going to do with it? In many cases you will have to ask “The Business” how to deal with this. For regulatory reasons you may not be able to deal with someone in Canada. Do you “drop the record on the floor” or send it somewhere so a “We’re sorry” communication can be sent? Or does it need to be routed to a subsidiary or parent company in that country? That is beyond the scope of the book but does show some of the issues you will have to deal with. But you know how to identify the person in any case:
SELECT
last_name
[
Last
Name
],
first_name
[
FirstName
]
FROM
dbo
.
PotentialMatches
WHERE
state
=
'NB'
Last Name | FirstName |
---|---|
Snedley | Clive |
Again, outer joins are useful when you know data on one “side” (I typically choose the right) could be missing or not matched when trying to join with data from the other side, in this case the left. In other words, if you think of the canonical list of all states as being in the reference table on the “left,” there may or may not be customers from all those states in the CRM table on the right, and hence when joining, you can expect NULL
values in result rows for the right-side columns where there is a row in the left with no match on the right.
Finding the Most Current Value
Consider the following “real-life” scenario.
A loan may move from an opportunity in a CRM system to a loan application in the loan origination system. Being different applications, these are different tables in different databases. Then perhaps the loan origination system moves the loan from the application phase to underwriting and the processing phase, which moves it to another table, too. Perhaps it then goes to a servicing table after the loan is signed by the customer and booked. That table may in fact be in another application database as well, as part of the banking “core” system. In all four of those tables across three applications will be a value for the total loan amount, which may naturally change along the way with the normal back-and-forth of getting a loan and then paying it off over time. So, the question becomes, “Regardless of the state of the loan—opportunity, application, processing, or servicing—what is the current loan amount?”
Behold the COALESCE
expression and its beauty (see also Chapter 2), which returns the first non-NULL
value in a list of values. Also notice how I naturally include the use of outer joins and their utility, too. Sneaky:
SELECT
COALESCE
(
L
.
LoanAmount
,
P
.
LoanAmount
,
A
.
LoanRequestAmount
,
O
.
OpportunityAmount
)
LoanAmount
FROM
crm
.
Opportunity
O
LEFT
OUTER
JOIN
loan
.
LoanApplication
A
ON
O
.
CustomerId
=
A
.
CustomerId
LEFT
OUTER
JOIN
loan
.
LoanProcessing
P
ON
O
.
CustomerId
=
P
.
CustomerId
LEFT
OUTER
JOIN
core
.
Loan
L
ON
O
.
CustomerId
=
L
.
CustomerId
WHERE
O
.
CustomerId
=
12345
;
First, we know the preceding is unreal because there is no way three disparate systems would share the same customer IDs, let alone have the same column names for them. We will cover how those types of joins are done across applications that don’t know about each other in later chapters, especially Chapter 9. But looking beyond that, we start the SELECT
with the Opportunity table, because that’s where any deal is going to start—with a loan officer opening an opportunity and filling in a few details. The LEFT OUTER JOIN
clauses then bring in the rows from the other applications, if they exist.
Since the flow of the loan is from opportunity to application to processing to servicing, we can see that in reverse order in the COALESCE
, a dashboard or report will typically want the “most current” information, so we start by looking at the core system’s L.LoanAmount field. If it is not NULL
, the loan has booked, and the core’s loan accounting system will have the currently calculated daily loan balance. If that is NULL
, however, then the other columns are examined, left to right, until a non-NULL
value is found, with O.OpportunityAmount being the guaranteed stop-gap (and also the fuzziest value when you think about it, since it will almost certainly be subject to change over the life of the loan process).
If the previous COALESCE
had been expressed as either a CASE
expression or an IIF
function, it would have been much more complicated and ugly!
Final Thoughts on SELECT
To do anything “in real life” that I discuss in this book, you will need at least SELECT
access to the tables and views under investigation. In Microsoft SQL Server, this can be easily granted with the db_datareader role on the database, but since that is a database-level role, it may not be appropriate to your environment if you work in a highly secured industry. You will need to be granted at least the following or equivalent on every table you will be accessing:
GRANT
SELECT
ON
Foo
.
Bar
TO
YourLogin
;
YourLogin can obviously be a database role or group in which your ID is a member. Of course, if you created the database(s), you are db_owner and can do what you want.
I discussed the utility of CTEs (see “Common Table Expressions”). I use them a lot in my work but can’t emphasize enough that they are not a panacea. If a query is going to get run repeatedly, it is better to create a view for it and let the SQL optimizer do its job. CTEs behave basically like macros, that is, “expanding in place” where they are referenced, and can be very slow if you create complicated chains of them. However, sometimes in a consulting job all you have is SELECT
granted and no Data Definition Language (DDL) access to CREATE
a view. Then you have to do what you have to do, and CTEs are often “good enough.”
In case you missed it, I find the CASE
expression flowing out of my fingertips almost every time I am doing matching or mapping of data. I find it much more readable than the IIF
function, even if it is a bit more typing:
-- This?
SELECT
IIF
(
Foo
=
1
,
'Foo'
,
'Unfoo'
)
[
Foo
?
]
FROM
Xyzzy
;
-- Or this?
SELECT
CASE
WHEN
Foo
=
1
THEN
'Foo'
ELSE
'Unfoo'
END
[
Foo
?
]
FROM
Xyzzy
;
In this case (I promise, that’s the last pun), they are equivalent, and the choice is a matter of personal taste. However, as so often happens, requirements change, and suddenly you may find yourself wanting to do the following:
-- No. Just no. This way lies (t)error!
SELECT
IIF
(
Foo
=
1
,
'Foo'
,
IIF
(
Bar
=
1
,
'Bar'
,
'Unbarrable'
))
[
Foo
or
Bar
?
]
FROM
Xyzzy
;
-- But this approach adapts pretty easily.
SELECT
CASE
WHEN
Foo
=
1
THEN
'Foo'
WHEN
Bar
=
1
THEN
'Bar'
ELSE
'Unbarrable'
END
[
Foo
or
Bar
?
]
FROM
Xyzzy
;
You can see that the IIF
approach will get ugly pretty fast as the tests to be checked multiply.
We’ve now seen that the absence of data, such as NULL
s returned from LEFT OUTER JOIN
queries, can be as interesting as its presence and that, in fact, missing data is the first level of “fuzzy.” Let’s move on to Chapter 2 and look at some useful SQL functions to add to our SELECT
statements to start dealing with all that fuzzy data out there!
Get Fuzzy Data Matching with SQL now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.