Projects and Grants per year
Grants and Contracts Details
Description
Reproducibility of results represents a significant challenge facing software engineering research today. Advancing
the state of the art in research areas driven primarily by empirical studies, such as maintenance, traceability, and
testing, requires researchers to not only propose new, more efficient, and effective approaches that address identified
problems, but also to compare these approaches to existing ones in order to demonstrate that they are either
complementary or superior in clearly defined ways. Unfortunately, this process can be time consuming and errorprone.
Existing approaches may be hard to reproduce for many reasons. Previously used datasets may be
unavailable; tools may be costly, proprietary, or may have become obsolete; and finally, implementation details such
as parameter values or environment factors may be missing from the original papers
Recently, Dit et al. undertook a survey on feature location (FL) techniques which revealed that only 5% of the
papers surveyed (three out of 60 papers) used the same datasets as those previously used by other researchers to
evaluate their techniques, and that only 38% of the papers surveyed (23 out of 60 papers) compared their proposed
feature location technique against even a small number of existing approaches. In addition, these findings are
consistent with the ones from the study by Robles, which determined that among the 154 research papers analyzed,
only two made their datasets and implementation available, and the vast majority of the papers described evaluations
that could not be reproduced due to lack of data, details, and tools.
Similarly, Shin et.al undertook a systematic literature review to determine what measures were used to evaluate
different tracing techniques. They found nine different metrics, including recall/precision, accuracy, f-measure,
average precision, Change, Lag, DiffAR, DiffMR, and Raw values, that were commonly used for evaluation
purposes, and showed that inconsistencies in the way these measures were computed meant that results were often
not comparable. In an earlier study, Hayes and Dekhtyar found similar problems such as poorly described
experiments, lack of publicly available datasets, and insufficient justification of metrics used, all of which led to lack
of repeatability.
A study by Gonzalez-Barahona and Roble identified several factors affecting the reproducibility of results in
empirical software engineering research and proposed a methodology for determining the reproducibility of a study.
In another study, Mytkowicz et al. investigated the influence of the omitted-variable bias (i.e., a bias in the results of
an experiment caused by omitting important causal factors from the design) in compiler optimization evaluation.
Their study showed that factors such as the environment size and the link order, which are often not reported and are
not explained properly in the research papers, are very common, unpredictable, and can influence the results
significantly. Moreover, D'Ambros et al. argued that many approaches in bug prediction have not been evaluated
properly because they were either evaluated in isolation from other techniques or compared against a limited
set of other approaches.
It is interesting to note that such problems are pervasive across a far broader field of scientific domains than just
software engineering. For example, a recent article that appeared in both the Economist and Los Angeles Times
cited a study in which a biotech firm (Amgen) decided to double check the results of 53 previously published
landmark papers which were fundamental to their multi-million dollar development plan. Shockingly, they were
only able to reproduce six of the studies. While this does not mean that the other studies were fraudulent, it does
mean that there was insufficient information to make them reproducible by others.
This issue of the reproducibility of experiments and approaches has been discussed and investigated in different
areas of empirical software engineering research and some initial steps have been taken towards solving this
problem. For example, efforts for establishing datasets or benchmarks that can be used uniformly in evaluations
have resulted in online benchmark repositories such as PROMISE, Eclipse Bug Data, SEMERU feature location
dataset, Bug Prediction Dataset, SIR and others. In addition, different infrastructures for running experiments were
introduced, such as TraceLab, RapidMiner, Simulink, Kepler, and others. Of these, TraceLab is a plug-and-play
framework that was specifically designed for facilitating the creation, evaluation, comparison, and sharing of
experiments in software engineering, thereby making experiments easily reproducible. In Section we expand on this
discussion of why TraceLab is highly suitable for facilitating and advancing software engineering research.
Status | Finished |
---|---|
Effective start/end date | 6/1/15 → 5/31/19 |
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.
Projects
- 1 Finished
-
REU: Supplemental Funding Request CI-EN: RUI: Collaborative Research: TraceLab Community Infrastructure for Replication, Collaboration, and Innovation
Hayes, J.
6/28/16 → 5/31/18
Project: Research project