To date, we have not realized our dream of evidence that is elegant, statistically sound, and replicable. And where we have found evidence in each of these categories, it has not always had the impact we hoped for. Perhaps we need to revisit our definitions of “convincing evidence.” Given the feedback in the previous section, is there a more practical definition of convincing evidence that should be motivating researchers?
A more feasible (if humble) definition is this: convincing evidence motivates change. We suspect that many of the authors in this book began their hunt for convincing evidence when they saw firsthand the problems and difficulties in real-world software development. The Holy Grail for such researchers tends to be the research results that can create real-world improvement.
To motivate change, some influential audience has to trust the evidence. One way to deal with the lack of rigor in experience reports or case studies might be to assign a “confidence rating” to each piece of evidence. The rating would attempt to reflect where each report fit on a spectrum that ranges from anecdotal or problematic evidence to very trustworthy evidence. Such evaluations are an essential part of the process proposed by Kitchenham for systematic reviews and aggregations of software engineering studies [Kitchenham 2004]. A simple scale aimed at helping to communicate confidence levels to practitioners can be found in a paper by Feldmann [Feldmann et al. 2006].
Generating truly convincing bodies of evidence would also require a shift in the way results are disseminated. It may well be that scientific publications are not the only technology required to convey evidence. Among other problems, these publications usually don’t make raw data available for other researchers to reanalyze, reinterpret, and verify. Also, although the authors of such publications try very hard to be exhaustive in their description of context, it is almost impossible to list every possible factor that could be relevant.
What may hold more promise is the creation of software engineering data repositories that contain results across different research environments (or at least across projects) and from which analyses can be undertaken. For example:
- The University of Nebraska’s Software-artifact Infrastructure Research (SIR) site
This repository stores software-related artifacts that researchers can use in rigorous controlled experimentation with program analysis and software testing techniques, and that educators can use to train students in controlled experimentation. The repository contains many Java and C software systems, in multiple versions, together with supporting artifacts such as test suites, fault data, and scripts. The artifacts in this repository have been used in hundreds of publications.
- The NASA Software Engineering Laboratory (SEL)
The NASA laboratory (described in Chapter 5) was a great success; its impact was large, and its data and results continue to be cited. One important lesson concerned its context: the leaders of the laboratory required researchers who wanted to use their data to spend time in the lab, so that they could understand the context correctly and not misinterpret results from their analyses.
This was an NSF-funded effort to create repositories of data and lessons learned that could be shared and reanalyzed by other researchers. We leveraged experiences from the SEL to explore ways of contextualizing the data with less overhead, such as by tagging all data sets with a rich set of metadata that described from where the data was drawn. These experiences in turn provided the underpinnings for a “lessons learned” repository maintained by the U.S. Defense Department’s Defense Acquisition University, which allows end users to specify their own context according to variables such as size of the project, criticality, or domain, and find practices with evidence from similar environments.
- The PROMISE project
An online repository of public-domain data. At the time of this writing, the repository has 91 data sets. Half refer to defect prediction, and the remainder explore effort prediction, model-based SE, text mining of SE data, and other issues.
An annual conference where authors are strongly encouraged not only to publish papers, but also to contribute to the repository the data they used to make their conclusions.
Special journal issues to publish the best papers from the conference [Menzies 2008].
Repositories are especially useful when they provide a link back to the contributors of the data. We need to recognize that data can seldom stand alone; authors should expect as a norm rather than an anomaly to have to answer questions and get involved in a dialogue with those trying to interpret their work. Such contacts are very important to learn the context from which the data was drawn, and possibly help users navigate to the data that is most relevant for their question or their own context. For example, Gunes Koru at University of Maryland, Baltimore County (UMBC) found a systematic error in some of the PROMISE data sets. Using the blogging software associated with the PROMISE repository, he was able to publicize the error. Further, through the PROMISE blog, he contacted the original data generators who offered extensive comments on the source of those errors and how to avoid them.