Statistical Analysis

The ArchLinux repositories contained 4,069 packages (as of April 2010), with some of the packages being different versions of the same upstream project. After removing different versions, we obtained a sample of 4,015 packages, containing 1,272,748 source code files. Among all those files, 576,511 were written in C. However, there were repeated files. In the overall sample, only 776,573 were unique files; in the C subsample, only 338,831 were unique files. From these unique C files, 212,167 were nonheader files and 126,664 were header files.

The same measurements shown in the previous section for the file included in Example 8-2 were repeated in our study for all the files written in C. Thus, we ended up with a set of more than 300,000 measurements. Each element of the set contained a tuple for each file containing the metrics (nine values in each tuple).

Overall Analysis

The basic analysis on the sample correlated each of the nine metrics defined at the file level with the rest of the metrics. The goal is to extract a set of orthogonal metrics that can characterize software size and complexity. Our goal was to discover, from all the metrics we gathered, which ones do not provide any further information and therefore can be discarded.

For the analysis, we considered each file of the sample as an independent point, which is a fair assumption because we discarded all the repeated files, and because all files came from projects that can be considered independent. For ...

Get Making Software now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.