Statistical Data Cleaning with Applications in R

Book description

A comprehensive guide to automated statistical data cleaning 

The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. Statistical Data Cleaning brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy.

Key features:

  • Focuses on the automation of data cleaning methods, including both theory and applications written in R.
    • Enables the reader to design data cleaning processes for either one-off analytical purposes or for setting up production systems that clean data on a regular basis.
    • Explores statistical techniques for solving issues such as incompleteness, contradictions and outliers, integration of data cleaning components and quality monitoring.
    • Supported by an accompanying website featuring data and R code.

This book enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. It can also be used as material for a course in data cleaning and analyses. 

Table of contents

  1. Cover
  2. Title Page
  3. Copyright
  4. Foreword
    1. What You Will Find in this Book
    2. For Who Is this Book?
    3. Acknowledgments
  5. About the Companion Website
  6. Chapter 1: Data Cleaning
    1. 1.1 The Statistical Value Chain
    2. 1.2 Notation and Conventions Used in this Book
  7. Chapter 2: A Brief Introduction to R
    1. 2.1 R on the Command Line
    2. 2.2 Vectors
    3. 2.3 Data Frames
    4. 2.4 Special Values
    5. 2.5 Getting Data into and out of R
    6. 2.6 Functions
    7. 2.7 Packages Used in this Book
  8. Chapter 3: Technical Representation of Data
    1. 3.1 Numeric Data
    2. 3.2 Text Data
    3. 3.3 Times and Dates
    4. 3.4 Notes on Locale Settings
  9. Chapter 4: Data Structure
    1. 4.1 Introduction
    2. 4.2 Tabular Data
    3. 4.3 Matrix Data
    4. 4.4 Time Series
    5. 4.5 Graph Data
    6. 4.6 Web Data
    7. 4.7 Other Data
    8. 4.8 Tidying Tabular Data
  10. Chapter 5: Cleaning Text Data
    1. 5.1 Character Normalization
    2. 5.2 Pattern Matching with Regular Expressions
    3. 5.3 Common String Processing Tasks in R
    4. 5.4 Approximate Text Matching
  11. Chapter 6: Data Validation
    1. 6.1 Introduction
    2. 6.2 A First Look at the validate Package
    3. 6.3 Defining Data Validation
    4. 6.4 A Formal Typology of Data Validation Functions
  12. Chapter 7: Localizing Errors in Data Records
    1. 7.1 Error Localization
    2. 7.2 Error Localization with R
    3. 7.3 Error Localization as MIP-Problem
    4. 7.4 Numerical Stability Issues
    5. 7.5 Practical Issues
    6. 7.6 Conclusion
    7. Appendix 7.A: Derivation of Eq. (7.33)
  13. Chapter 8: Rule Set Maintenance and Simplification
    1. 8.1 Quality of Validation Rules
    2. 8.2 Rules in the Language of Logic
    3. 8.3 Rule Set Issues
    4. 8.4 Detection and Simplification Procedure
    5. 8.5 Conclusion
  14. Chapter 9: Methods Based on Models for Domain Knowledge
    1. 9.1 Correction with Data Modifying Rules
    2. 9.2 Rule-Based Correction with dcmodify
    3. 9.3 Deductive Correction
  15. Chapter 10: Imputation and Adjustment
    1. 10.1 Missing Data
    2. 10.2 Model-Based Imputation
    3. 10.3 Model-Based Imputation in R
    4. 10.4 Donor Imputation with R
    5. 10.5 Other Methods in the simputation Package
    6. 10.6 Imputation Based on the EM Algorithm
    7. 10.7 Sampling Variance under Imputation
    8. 10.8 Multiple Imputations
    9. 10.9 Analytic Approaches to Estimate Variance of Imputation
    10. 10.10 Choosing an Imputation Method
    11. 10.11 Constraint Value Adjustment
  16. Chapter 11: Example: A Small Data-Cleaning System
    1. 11.1 Setup
    2. 11.2 Monitoring Changes in Data
    3. 11.3 Integration and Automation
  17. References
  18. Index
  19. End User License Agreement

Product information

  • Title: Statistical Data Cleaning with Applications in R
  • Author(s): Mark van der Loo, Edwin de Jonge
  • Release date: April 2018
  • Publisher(s): Wiley
  • ISBN: 9781118897157