Once we had evidence of bug concentration, we were able to start looking for characteristics of the buggy code that would allow us to identify it. The software in an ongoing project provides two classes of properties that potentially can be used to characterize particular code units. The first class consists of static structural code properties that can be extracted directly from the source code. They include things such as programming language, the number of lines of code in a file, the number of method calls, and various software complexity metrics.
The second class consists of process properties, which relate to the history of the system’s development and testing. They include information such as the number of changes and faults that were detected in previous releases, and the length of time that a particular code unit has been part of the system.
The goal of our early research was to find properties of the files from both of these classes that had a strong correlation with the occurrence of faults in the files. The first two systems that we studied provided enough evidence for us to build preliminary models that did a creditable job of predicting which files would be the most likely to have the largest number of faults in future releases of those systems.
The third system we examined, the Voice Response system, was similar to the first two systems in size, duration of time in the field, and multideveloper team, but it used a development process that ...