3
Advanced Topics in Initial Exploration and Dataset Preparation Using VisMiner
In Chapter 2, as part of an initial exploration, most of the viewers for data visualization were introduced. At this time, the correlation matrix and the parallel plot were also used to create data subsets. The correlation matrix allowed us to project attributes (dimension reduction) from a dataset, while the parallel plot allowed us to both project attributes and filter observations.
In Chapter 3, although the location plot viewer is introduced, we primarily present additional functionality for dataset preparation. Specifically, we use VisMiner to:
- handle missing values
- create computed columns
- aggregate observations
- merge datasets
- detect and eliminate outliers.
Missing Values
When working with “real world” data, a common problem is that of missing data. Most analysis algorithms require a complete set of data in order to conduct the analysis. VisMiner is no exception. It requires that all missing values be handled in a dataset before generating visualizations of the dataset or applying data mining algorithms.
Missing values are typically handled in one of five different ways.
- Eliminate any observations with missing data from the dataset. This is usually an acceptable solution when few observations relative to the total number of observations in the dataset contain missing values.
- Keep the observations, but drop the column with missing values. This option may be acceptable when most of the missing ...