CI-EN: Collaborative Research: TraceLab Community Infrastructure for Replication, Collaboration, and Innovation

  • Hayes, Jane (PI)

Grants and Contracts Details


Reproducibility of results represents a significant challenge facing software engineering research today. Advancing the state of the art in research areas driven primarily by empirical studies, such as maintenance, traceability, and testing, requires researchers to not only propose new, more efficient, and effective approaches that address identified problems, but also to compare these approaches to existing ones in order to demonstrate that they are either complementary or superior in clearly defined ways. Unfortunately, this process can be time consuming and errorprone. Existing approaches may be hard to reproduce for many reasons. Previously used datasets may be unavailable; tools may be costly, proprietary, or may have become obsolete; and finally, implementation details such as parameter values or environment factors may be missing from the original papers Recently, Dit et al. undertook a survey on feature location (FL) techniques which revealed that only 5% of the papers surveyed (three out of 60 papers) used the same datasets as those previously used by other researchers to evaluate their techniques, and that only 38% of the papers surveyed (23 out of 60 papers) compared their proposed feature location technique against even a small number of existing approaches. In addition, these findings are consistent with the ones from the study by Robles, which determined that among the 154 research papers analyzed, only two made their datasets and implementation available, and the vast majority of the papers described evaluations that could not be reproduced due to lack of data, details, and tools. Similarly, Shin undertook a systematic literature review to determine what measures were used to evaluate different tracing techniques. They found nine different metrics, including recall/precision, accuracy, f-measure, average precision, Change, Lag, DiffAR, DiffMR, and Raw values, that were commonly used for evaluation purposes, and showed that inconsistencies in the way these measures were computed meant that results were often not comparable. In an earlier study, Hayes and Dekhtyar found similar problems such as poorly described experiments, lack of publicly available datasets, and insufficient justification of metrics used, all of which led to lack of repeatability. A study by Gonzalez-Barahona and Roble identified several factors affecting the reproducibility of results in empirical software engineering research and proposed a methodology for determining the reproducibility of a study. In another study, Mytkowicz et al. investigated the influence of the omitted-variable bias (i.e., a bias in the results of an experiment caused by omitting important causal factors from the design) in compiler optimization evaluation. Their study showed that factors such as the environment size and the link order, which are often not reported and are not explained properly in the research papers, are very common, unpredictable, and can influence the results significantly. Moreover, D'Ambros et al. argued that many approaches in bug prediction have not been evaluated properly because they were either evaluated in isolation from other techniques or compared against a limited set of other approaches. It is interesting to note that such problems are pervasive across a far broader field of scientific domains than just software engineering. For example, a recent article that appeared in both the Economist and Los Angeles Times cited a study in which a biotech firm (Amgen) decided to double check the results of 53 previously published landmark papers which were fundamental to their multi-million dollar development plan. Shockingly, they were only able to reproduce six of the studies. While this does not mean that the other studies were fraudulent, it does mean that there was insufficient information to make them reproducible by others. This issue of the reproducibility of experiments and approaches has been discussed and investigated in different areas of empirical software engineering research and some initial steps have been taken towards solving this problem. For example, efforts for establishing datasets or benchmarks that can be used uniformly in evaluations have resulted in online benchmark repositories such as PROMISE, Eclipse Bug Data, SEMERU feature location dataset, Bug Prediction Dataset, SIR and others. In addition, different infrastructures for running experiments were introduced, such as TraceLab, RapidMiner, Simulink, Kepler, and others. Of these, TraceLab is a plug-and-play framework that was specifically designed for facilitating the creation, evaluation, comparison, and sharing of experiments in software engineering, thereby making experiments easily reproducible. In Section we expand on this discussion of why TraceLab is highly suitable for facilitating and advancing software engineering research.
Effective start/end date6/1/155/31/19


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.