Cody's Data Cleaning Techniques Using SAS, Second Edition

Book description

Thoroughly updated for SAS 9, Cody's Data Cleaning Techniques Using SAS, Second Edition, addresses tasks that nearly every SAS programmer needs to do - that is, make sure that data errors are located and corrected. Written in Ron Cody's signature informal, tutorial style, this book develops and demonstrates data cleaning programs and macros that you can use as written or modify for your own special data cleaning needs. Each topic is developed through specific examples, and every program and macro is explained in detail.

Table of contents

  1. Table of Contents (1/4)
  2. Table of Contents (2/4)
  3. Table of Contents (3/4)
  4. Table of Contents (4/4)
  5. Checking Values of Character Variables
    1. Introduction
    2. Using PROC FREQ to List Values
      1. Description of the Raw Data File PATIENTS.TXT
    3. Using a DATA Step to Check for Invalid Values
      1. Describing the VERIFY, TRIM, MISSING, and NOTDIGIT Functions
    4. Using PROC PRINT with a WHERE Statement to List
    5. Invalid Values
    6. Using Formats to Check for Invalid Values
    7. Using Informats to Remove Invalid Values
  6. Checking Values of Numeric Variables
    1. Introduction
    2. Using PROC MEANS, PROC TABULATE, and PROC UNIVARIATE to Look for Outliers (1/2)
    3. Using PROC MEANS, PROC TABULATE, and PROC UNIVARIATE to Look for Outliers (2/2)
    4. Using an ODS SELECT Statement to List Extreme Values
    5. Using PROC UNIVARIATE Options to List More Extreme Observations
    6. Using PROC UNIVARIATE to Look for Highest and Lowest Values by Percentage (1/2)
    7. Using PROC UNIVARIATE to Look for Highest and Lowest Values by Percentage (2/2)
    8. Using PROC RANK to Look for Highest and Lowest Values by Percentage
    9. Presenting a Program to List the Highest and Lowest Ten Values
    10. Presenting a Macro to List the Highest and Lowest "n" Values
    11. Using PROC PRINT with a WHERE Statement to List Invalid Data Values
    12. Using a DATA Step to Check for Out-of-Range Values
    13. Identifying Invalid Values versus Missing Values
    14. Listing Invalid (Character) Values in the Error Report
    15. Creating a Macro for Range Checking
    16. Checking Ranges for Several Variables
    17. Using Formats to Check for Invalid Values
    18. Using Informats to Filter Invalid Values
    19. Checking a Range Using an Algorithm Based on Standard Deviation
    20. Detecting Outliers Based on a Trimmed Mean and Standard Deviation
    21. Presenting a Macro Based on Trimmed Statistics
    22. Using the TRIM Option of PROC UNIVARIATE and ODS to Compute Trimmed Statistics (1/2)
    23. Using the TRIM Option of PROC UNIVARIATE and ODS to Compute Trimmed Statistics (2/2)
    24. Checking a Range Based on the Interquartile Range
    25. Summary
  7. Checking for Missing Values
    1. Introduction
    2. Inspecting the SAS Log
    3. Using PROC MEANS and PROC FREQ to Count Missing Values
    4. Using DATA Step Approaches to Identify and Count Missing Values
    5. Searching for a Specific Numeric Value
    6. Creating a Macro to Search for Specific Numeric Values
  8. Working with Dates
    1. Introduction
    2. Checking Ranges for Dates (Using a DATA Step)
    3. Checking Ranges for Dates (Using PROC PRINT)
    4. Checking for Invalid Dates
    5. Working with Dates in Nonstandard Form
    6. Creating a SAS Date When the Day of the Month Is
    7. Missing
    8. Suspending Error Checking for Known Invalid Dates
  9. Looking for Duplicates and “n” Observations per Subject
    1. Introduction
    2. Eliminating Duplicates by Using PROC SORT
    3. Detecting Duplicates by Using DATA Step Approaches
    4. Using PROC FREQ to Detect Duplicate ID's
    5. Selecting Patients with Duplicate Observations by Using a
    6. Macro List and SQL
    7. Identifying Subjects with "n" Observations Each (DATA Step
    8. Approach)
    9. Identifying Subjects with "n" Observations Each (Using
    10. PROC FREQ)
  10. Working with Multiple Files
    1. Introduction
    2. Checking for an ID in Each of Two Files
    3. Checking for an ID in Each of "n" Files
    4. A Macro for ID Checking
    5. More Complicated Multi-File Rules
    6. Checking That the Dates Are in the Proper Order
  11. Double Entry and Verification ( PROC COMPARE)
    1. Introduction
    2. Conducting a Simple Comparison of Two Data Sets (1/2)
    3. Conducting a Simple Comparison of Two Data Sets (2/2)
    4. Using PROC COMPARE with Two Data Sets That Have an Unequal Number of Observations
    5. Comparing Two Data Sets When Some Variables Are Not in Both Data Sets
  12. Some PROC SQL Solutions to Data Cleaning
    1. Introduction
    2. A Quick Review of PROC SQL
    3. Checking for Invalid Character Values
    4. Checking for Outliers
    5. Checking a Range Using an Algorithm Based on the Standard Deviation
    6. Checking for Missing Values
    7. Range Checking for Dates
    8. Checking for Duplicates
    9. Identifying Subjects with "n" Observations Each
    10. Checking for an ID in Each of Two Files
    11. More Complicated Multi-File Rules
  13. Correcting Errors
    1. Introduction
    2. Hardcoding Corrections
    3. Describing Named Input
    4. Reviewing the UPDATE Statement
  14. Creating Integrity Constraints and Audit Trails
    1. Introducing SAS Integrity Constraints
    2. Demonstrating General Integrity Constraints
    3. Deleting an Integrity Constraint Using PROC DATASETS
    4. Creating an Audit Trail Data Set
    5. Demonstrating an Integrity Constraint Involving More than One Variable
    6. Demonstrating a Referential Constraint
    7. Attempting to Delete a Primary Key When a Foreign Key Still Exists
    8. Attempting to Add a Name to the Child Data Set
    9. Demonstrating the Cascade Feature of a Referential Constraint
    10. Demonstrating the SET NULL Feature of a Referential Constraint
    11. Demonstrating How to Delete a Referential Constraint
  15. DataFlux and dfPower Studio
    1. Introduction
    2. Examples
  16. Listing of Raw Data Files and SAS Programs
    1. Programs and Raw Data Files Used in This Book
    2. Description of the Raw Data File PATIENTS. TXT
    3. Layout for the Data File PATIENTS. TXT
    4. Listing of Raw Data File PATIENTS. TXT
    5. Program to Create the SAS Data Set PATIENTS
    6. Listing of Raw Data File PATIENTS2. TXT
    7. Program to Create the SAS Data Set PATIENTS2
    8. Program to Create the SAS Data Set AE
    9. ( Adverse Events)
    10. Program to Create the SAS Data Set LAB_ TEST
    11. Listing of the Data Cleaning Macros Used in
    12. This Book
      1. Creating a Macro to List the Highest and Lowest " n" Percent of the Data Using PROC UNIVARIATE
      2. Creating a Macro to List the Highest and Lowest " n" Percent of the Data Using PROC RANK
      3. Creating a Macro to List the Highest and Lowest " n" Values
      4. Creating a Macro to List Out- of- Range Data Values
      5. Writing a Program to Summarize Data Errors on Several Variables
      6. Creating a Macro to Detect Outliers Based on Trimmed Statistics
      7. Creating a Macro to List Outliers of Several Variables Based on Trimmed Statistics ( Using PROC UNIVARIATE)
      8. Detecting Outliers Based on the Interquartile Range
      9. Creating a Macro to Search for Specific Numeric Values
      10. Creating a Macro to Check for ID's Across Multiple Data Sets
  17. Index (1/3)
  18. Index (2/3)
  19. Index (3/3)

Product information

  • Title: Cody's Data Cleaning Techniques Using SAS, Second Edition
  • Author(s): Ron Cody
  • Release date: April 2015
  • Publisher(s): SAS Institute
  • ISBN: 9781599948324