18 Cody’s Data Cleaning Techniques Using SAS, Second Edition
if put(Gender,$gender.) = 'Miscoded' then put Patno= Gender=;
if put(AE,$ae.) = 'Miscoded' then put Patno= AE=;
The "heart" of this program is the PUT function. To review, the PUT function is similar to the
INPUT function. It takes the following form:
character_variable = put(variable, format)
where character_variable is a character variable that contains the value of the variable listed as
the first argument to the function, formatted by the format listed as the second argument to the
function. The result of a PUT function is always a character variable, and the function is
frequently used to perform numeric-to-character conversions. In Program 1-8, the first argument
of the PUT function is a character variable you want to test and the second argument is the
corresponding character format. The result of the PUT function for any invalid data values would
be the value 'Miscoded'.
Here is the output from Program 1-8.
Listing of invalid patient numbers and data values
Patno=003 gender=X
Patno=004 AE=A
Patno=010 gender=f
Patno=013 gender=2
Patno=023 gender=f
Using Informats to Remove Invalid Values
PROC FORMAT is also used to create informats. Remember that formats are used to control how
variables look in output or how they are classified by such procedures as PROC FREQ. Informats
modify the value of variables as they are read from the raw data, or they can be used with an
INPUT function to create new variables in the DATA step. User-defined informats are created in
much the same way as user-defined formats. Instead of a VALUE statement that creates formats,
an INVALUE statement is used to create informats. The only difference between the two is that
informat names can only be 31 characters in length. (Note: For those curious readers, the reason
is that informats and formats are both stored in the same catalog and an "@" is placed before
informats to distinguish them from formats.) The following is a program that changes invalid
values for GENDER and AE to missing values by using a user-defined informat.
Chapter 1 Checking Values of Character Variables 19
Program 1-9 Using a User-Defined Informat to Set Invalid Data Values to Missing
| Purpose: To create a SAS data set called PATIENTS2 |
| and set any invalid values for Gender and AE to |
| missing, using a user-defined informat |
libname clean "c:\books\clean";
proc format;
invalue $gen 'F','M' = _same_
other = ' ';
invalue $ae '0','1' = _same_
other = ' ';
data clean.patients_filtered;
infile "c:\books\clean\patients.txt" pad;
input @1 Patno $3.
@4 Gender $gen1.
@27 AE $ae1.;
label Patno = "Patient Number"
Gender = "Gender"
AE = "adverse event?";
title "Listing of data set PATIENTS_FILTERED";
proc print data=clean.patients_filtered;
var Patno Gender AE;
Notice the INVALUE statements in the PROC FORMAT above. The keyword _SAME_ is a SAS
reserved value that does what its name implies—it leaves any of the values listed in the range
specification unchanged. The keyword OTHER in the subsequent line refers to any values not
matching one of the previous ranges. Notice also that the informats in the INPUT statement use
the user-defined informat name followed by the number of columns to be read, the same method
that is used with predefined SAS informats.
20 Cody’s Data Cleaning Techniques Using SAS, Second Edition
Output from the PROC PRINT is shown next.
Listing of data set PATIENTS_FILTERED
Obs Patno Gender AE
1 001 M 0
2 002 F 0
3 003 1
4 004 F
5 XX5 M 0
6 006 1
7 007 M 0
8 M 0
9 008 F 0
10 009 M 1
11 010 0
12 011 M 1
13 012 M 0
14 013
15 014 M 1
16 002 F 0
17 003 M 0
18 015 F 1
19 017 F 0
20 019 M 0
21 123 M 0
22 321 F 1
23 020 F 0
24 022 M 1
25 023 0
26 024 F 0
27 025 M 1
28 027 F 0
29 028 F 0
30 029 M 1
31 006 F 0
Notice that invalid values for GENDER and AE are now missing values, including the two
lowercase 'f's (patient numbers 010 and 023).

Get Cody's Data Cleaning Techniques Using SAS, Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.