Detecting Duplicates by Using DATA Step Approaches
Let’s explore the ways that will allow you to detect duplicate ID’s and duplicate observations in a data set. One very good way to approach this problem is to use the temporary SAS variables FIRST, and LAST. To see how this works, look at Program 5-4, which prints out all observations that have duplicate patient numbers.
Program 5-4. Identifying Duplicate ID’s
PROC SORT DATA=CLEAN.PATIENTS OUT=TMP; 1 BY PATNO; RUN; DATA DUP; SET TMP; BY PATNO; 2 IF FIRST.PATNO AND LAST.PATNO THEN DELETE; 3 RUN; PROC PRINT DATA=DUP; TITLE "Listing of Duplicates from Data Set CLEAN.PATIENTS"; ID PATNO; RUN; |
It’s first necessary to sort the data set by the ID variable . In the above program, the original data ...
Get Cody’s Data Cleaning Techniques Using SAS® Software now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.