Handling Timing Errors in Distributed Programs

Aaron J. Gordon, Raphael A. Finkel

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

In a distributed environment, events occur concurrently on different processors. The order in which events occur cannot be easily determined; a program that works correctly one time may fail subsequently if the timing between processors changes. For this research, we have investigated distributed program bugs that depend on the relative order between events. We describe a tool (called TAP) to aid the programmer in discovering the causes of timing errors in running programs. TAP, a tool similar to a postmortem debugger, uses the history of interprocess communication to construct a timing graph, a directed graph where an edge joins node x to node y if event x directly precedes event y in time. The programmer can then use TAP to look at the graph to find the events that occurred in an unacceptable order. Because of the nondeterministic nature of distributed programs, we feel a history-keeping mechanism must always be active so that bugs can be dealt with as they occur. Our goal is to collect enough information at run time to construct the timing graph if needed. Since it is always active, this mechanism must be efficient. We also describe experiments run using TAP and report the impact that TAP'S history-keeping mechanism has on the running time of various distributed programs.

Original languageEnglish
Pages (from-to)1525-1535
Number of pages11
JournalIEEE Transactions on Software Engineering
Volume14
Issue number10
DOIs
StatePublished - Oct 1988

Bibliographical note

Funding Information:
Manuscript received September 8, 1986; revised June 2, 1987. This work was supported in part by the National Science Foundation under Grant MCS-8105904 and by the Defense Advanced Research Projects Agency under Grant N0014-82-C-2087. A. J. Gordon is with the Department of Mathematical and Computer Sciences, Colorado School of Mines. Golden, CO 80401, R. A. Finkel is with the Department of Computer Science, University of Kentucky, Lexington, KY 40506. IEEE Log Number 8823072.

Funding

Manuscript received September 8, 1986; revised June 2, 1987. This work was supported in part by the National Science Foundation under Grant MCS-8105904 and by the Defense Advanced Research Projects Agency under Grant N0014-82-C-2087. A. J. Gordon is with the Department of Mathematical and Computer Sciences, Colorado School of Mines. Golden, CO 80401, R. A. Finkel is with the Department of Computer Science, University of Kentucky, Lexington, KY 40506. IEEE Log Number 8823072.

FundersFunder number
National Science Foundation (NSF)MCS-8105904
Defense Advanced Research Projects AgencyN0014-82-C-2087

    Keywords

    • Debugging
    • distributed programming
    • timing errors

    ASJC Scopus subject areas

    • Software

    Fingerprint

    Dive into the research topics of 'Handling Timing Errors in Distributed Programs'. Together they form a unique fingerprint.

    Cite this