Data understanding and preparation

This dataset consists of tissue samples from 699 patients. It is in a data frame with 11 variables, as follows:

  • ID: Sample code number
  • V1: Thickness
  • V2: Uniformity of the cell size
  • V3: Uniformity of the cell shape
  • V4: Marginal adhesion
  • V5: Single epithelial cell size
  • V6: Bare nucleus (16 observations are missing)
  • V7: Bland chromatin
  • V8: Normal nucleolus
  • V9: Mitosis
  • class: Whether the tumor diagnosis is benign or malignant; this will be the outcome that we are trying to predict

The medical team has scored and coded each of the nine features on a scale of 1 to 10.

The data frame is available in the R MASS package under the biopsy name. To prepare this data, we will load the data frame, confirm the structure, ...

Get Mastering Machine Learning with R - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.