Chapter 3Managing Data

In Chapter 1, we discussed some of the foundational principles behind machine learning. We followed that discussion with an introduction to both the R programming language and the RStudio development environment in Chapter 2. In this chapter, we explain how to use R to manage our data prior to modeling. The quality of a machine learning model is only as good as the data used to build it. Quite often, this data is not easily accessible, is in the wrong format, or is hard to understand. As a result, it is critically important that prior to building a model, we spend as much time as needed to collect the data we need, explore and understand the data we have, and prepare it so that it is useful for the selected machine learning approach. Typically, images percent of the time we spend in machine learning is, or should be, spent managing data.

By the end of this chapter, you will have learned the following:

  • What the tidyverse is and how to use it to manage data in R
  • How to collect data using R and some of the key things to consider when collecting data
  • Different approaches to describe and visualize data in R
  • How to clean, transform, and reduce data to make it more useful for the machine learning process

THE TIDYVERSE

The tidyverse is a collection of R packages designed to facilitate the entire analytics process by offering a standardized format for exchanging data ...

Get Practical Machine Learning in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.