Challenges and Methods
In this part of the book Data Science for Software Engineering: Sharing Data and Models, we show that sharing all data is less useful that sharing just the relevant data. There are several useful methods for finding those relevant data regions including simple nearest neighbor, or kNN, algorithms; clustering (to optimize subsequent kNN); and pruning away “bad” regions. Also, we show that with clustering, it is possible to repair missing data in project records.
This book now turns to the complex issue of sharing data and models. In this part, we discuss sharing data (and the next part, starting on page 235 discusses sharing models).
Until very recently, there was much pessimism about ...