♣18♣Dealing with Missing Data

Most datasets will have missing values, and missing values will make any modelling a lot more complicated. This section will help you to deal with those missing values.

However, it is – as usual – better to avoid missing data than solve the problem later on. In the first place, we need to be carefulwhen collecting data. The personwho collects the data, is usually not the one reading this book. So, the data scientist reading this book, can create awareness to the higher management that can then start measuring data quality and improve awareness or invest in better systems.

We can also change software so that it helps to collect correct data. If, for example, the retail staff systematically leaves the field “birth date” empty, then we can via software refuse that birth dates that are left empty, pop up a warning if the customer is over 100 years old and simple do not accept birth dates that infer an age over 150. The management can also show numbers related to losses due to loans that were accepted onwrong information. Also the procedures can be adapted, and for example require a copy of the customer's ID card, and the audit department can then check compliance with this procedure, etc.

However, there will still be cases where some data is missing. Even if the data quality is generally fine, and we still have thousands of observations left after leaving out the small percentage of missing data, then it is still essential to find out why the data ...

Get The Big R-Book now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.