Data understanding and preparation

The dataset for the 532 women is in two separate data frames. The variables of interest are as follows:

  • npreg: This is the number of pregnancies
  • glu: This is the plasma glucose concentration in an oral glucose tolerance test
  • bp: This is the diastolic blood pressure (mm Hg)
  • skin: This is triceps skin-fold thickness measured in mm
  • bmi: This is the body mass index
  • ped: This is the diabetes pedigree function
  • age: This is the age in years
  • type: This is diabetic, Yes or No

The datasets are contained in the R package, MASS. One data frame is named Pima.tr and the other is named Pima.te. Instead of using these as separate train and test sets, we will combine them and create our own in order to discover how to ...

Get Mastering Machine Learning with R - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.