9.2. Framing the Problem and Collecting the Data: The Wisconsin Breast Cancer Diagnostic Data Set

The Wisconsin Breast Cancer Diagnostic Data Set arises in connection with diagnosing breast tumors based on a fine needle aspirate.[] In this study, a small-gauge needle is used to remove fluid directly from the lump or mass. The fluid is placed on a glass slide and stained so as to reveal the nuclei of the cells. An imaging system is used to determine the boundaries of the nuclei. A typical image consists of 10 to 40 nuclei. The associated software computes ten characteristics for each nucleus: radius, perimeter, area, texture, smoothness, compactness, number of concave regions, size of concavities (a concavity is an indentation in the cell nucleus), symmetry, and fractal dimension of the boundary (a measure of regularity of the contour). Values of the last seven characteristics were computed in such a way that larger values correspond to more irregular cells.[]

A set of 569 lumps with known diagnoses (malignant or benign) was sampled, and each resulting image was processed as described above. Since a typical image can contain from 10 to 40 nuclei, the measurements were summarized. For each characteristic, the mean, max, and standard error of the mean were computed, resulting in 30 variables. The model developed by the researchers was based on separating hyperplanes.[] A best model was chosen by applying cross-validation to estimate prediction accuracy, using all 569 records as a ...

Get Visual Six Sigma: Making Data Analysis Lean now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.