174 Cody’s Data Cleaning Techniques Using SAS, Second Edition
Identifying Subjects with "n" Observations Each
Using the grouping capability of PROC SQL and the COUNT function, you can list all patients
who do not have exactly "n" visits or observations in a data set, just as you did in Programs 5-11
and 5-12. Here is the program with an explanation following.
Program 8-8 Using SQL to List Patients Who Do Not Have Two Visits
title "Listing of patients who do not have two visits";
proc sql;
select Patno,
Visit
from clean.patients2
group by Patno
having count(Patno) ne 2;
quit;
By first grouping the observations by patient number, you can then use the COUNT function,
which returns the number of observations in a group. Here is the output from Program 8-8.
Listing of patients who do not have two visits
Patno Visit
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
002 01/01/1999
002 02/09/1999
002 01/10/1999
003 10/21/1998
006 11/11/1998
Checking for an ID in Each of Two Files
Do you think PROC SQL can check if each patient number is in two files? Why else is there a
section heading with that task listed? Of course you can! Now, on to the problem.
The equivalent of a DATA step merge is called a JOIN in SQL terms. Normally, a JOIN lists only
those observations that have a matching value for the variables in each of the files. If you want all
observations from both files, regardless if they have a corresponding observation in the other file,
you perform a FULL JOIN (this is equivalent to a MERGE where no IN= variables are used). So,
if you perform a FULL JOIN between two data sets and an ID value is not in both data sets, one
Chapter 8 Some PROC SQL Solutions to Data Cleaning 175
of the observations will have a missing value for the ID variable. Let's use the same data sets,
ONE and TWO, that were used in Chapter 6. For convenience, the code to produce these data sets
is shown in Program 8-9.
Program 8-9 Creating Two Data Sets for Testing Purposes
data one;
input Patno X Y;
datalines;
1 69 79
2 56 .
3 66 99
5 98 87
12 13 14
;
data two;
input Patno Z;
datalines;
1 56
3 67
4 88
5 98
13 99
;
Here is the PROC SQL solution for finding ID's that are missing from one of the files:
Program 8-10 Using SQL to Look for ID's That Are Not in Each of Two Files
title "Patient numbers not in both files";
proc sql;
select One.patno as ID_one,
Two.patno as ID_two
from one full join two
on One.patno eq Two.patno
where One.patno is missing or Two.patno is missing;
quit;
Because the variable name Patno is used in both data sets, you can distinguish between them by
adding either ONE. or TWO. before the variable name, depending on whether you are referring to
the patient number from data set ONE or data set TWO. Also, to make it easier to keep track of

Get Cody's Data Cleaning Techniques Using SAS, Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.