Chapter 2

Data Cleaning and Pre-processing

Learning Objectives

By the end of this chapter, you will be able to:

  • Perform the sort, rank, filter, subset, normalize, scale, and join operations in an R data frame.
  • Identify and handle outliers, missing values, and duplicates gracefully using the MICE and rpart packages.
  • Perform undersampling and oversampling on a dataset.
  • Apply the concepts of ROSE and SMOTE to handle unbalanced data.

This chapter covers the important concepts of handling data and making the data ready for analysis.

Introduction

Data cleaning and preparation takes about 70% of the effort in the entire process of a machine learning project. This step is essential because the quality of the data determines the accuracy of the ...

Get Practical Machine Learning with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.