16Interpreting a Topological Measure of Complexity for Decision Boundaries

We propose a method to examine the decision boundaries of classification algorithms to yield insight into the nature of overfitting. In machine learning, model evaluation can be performed via two common techniques: train–test split or cross-validation. In this chapter, we expand this toolkit to include tools from the field of topological data analysis. In particular, we use persistent homology, which roughly characterizes the shape of a data set.

Our method focuses on binary classification, using training data to sample points on the decision boundary of the feature space. We then calculate the persistent homology of this sample and compute metrics to quantify the complexity of the decision boundary. Our experiments with data sets in various dimensions suggest that in certain cases, our measures of complexity are correlated with a model’s ability to generalize to unseen data. We hope that refining this method will lead to a better understanding of overfitting and a means to compare models.

16.1. Introduction

In this chapter, we introduce and investigate the usage of a toolkit known as topological data analysis (TDA) to deeply understand classification algorithms by inspecting decision boundaries, which can yield signs of overfitting, among other information. This toolkit is motivated by the notion that data intrinsically has a shape, and that this shape can be recovered computationally; in this chapter, ...

Get Data Analysis and Related Applications, Volume 1 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.