Dissertation/Thesis Abstract

Addressing production run failures dynamically
by Tucek, Joseph A., Ph.D., University of Illinois at Urbana-Champaign, 2011, 163; 3503707
Abstract (Summary)

The high complexity of modern software, and our pervasive reliance on that software, has made the problems of software reliability increasingly important. Yet despite advances in software engineering practice, pre-release testing, and automated analysis, reports of high-profile production failures are still common. This dissertation proposes several run-time techniques to analyze and alleviate software failures dynamically, during production runs.

The first technique is low overhead checkpoint, rollback, and re-execution. By allowing a window of time in which a period of execution can be relived, low overhead checkpointing allows expensive analytical steps to be saved for only when they are needed. The second technique is a collection of dynamically insertable run-time analysis tools, which can use information gleaned over multiple analytical runs of the same execution to incrementally build picture of a production run failure more completely than any individual analysis could. Finally, based on my experience with the behavior of programs under failure, and the underlying causes of said failures, this dissertation introduces the concept of, and provides a run time which supports, delta execution. Delta execution (or Δ execution) is the process of running more than one instance or version of a program, while sharing the majority of issued instructions and state. This dissertation uses Δ execution specifically to validate software patches at production run time.

These three techniques have been demonstrated in three implemented systems supporting various end-level reliability goals. The first system, called Sweeper, is a run-time defensive system against security bugs. The Sweeper system imposes only 1% overhead in ordinary operation, and can generate an effective protective measure in only 60 milliseconds. From an analytic model, this is sufficient to minimize the spread of a fast worm to only 5% of the susceptible hosts, even for a worm which spreads 10,000 times faster than any previously observed in the wild.

The second system is called Triage. Rather than improving reliability by improving security, Triage attempts to enable the improvement of the underlying code by automating failure diagnosis of production run systems. Triage performs failure diagnosis post-hoc at the end-user's site. Low overhead checkpointing allows the capture of a failing execution, so expensive analysis can be deferred until it is definitely needed. Repeated replays allows the incremental application of a variety of failure analysis techniques, similar to the process a human programmer may undertake. For analysis which generally takes direction from a human, Triage substitutes the results of previous analytical steps. Overall, Triage imposes only 5% overhead in failure free execution, and, if a failure occurs, all of the analysis which requires re-execution is complete within about 5 minutes. In a study with human programmers, the output of Triage analysis reduced the time to patch real software faults by 45%.

The third system presented in this dissertation deals with the problems introduced when programmers make changes. This dissertation proposes Δ execution. If the execution (in terms of instruction streams and data) of the patched and unpatched versions of a program are mostly identical, then it is possible to run both versions mostly in one instruction stream. Only rarely, when the executions do differ, is it necessary to run two sets of instructions. By only running the differing, or delta, segments separately, Δ execution allows low overhead production run patch validation which is 12% faster than side-by-side patch validation. Further (and perhaps more important), many of the effects which make patch validation difficult (multithreading, timing sensitivity, and system level nondeterminism) are nullified as they effect the two logical executions inside the one physical execution identically. This dissertation shows that, of ten applications tested, Δ execution can validate all of the patches, while traditional side-by-side validation only manages to validate 2. (Abstract shortened by UMI.)

Indexing (document details)
Advisor: Zhou, Yuanyuan
School: University of Illinois at Urbana-Champaign
Department: Computer Science
School Location: United States -- Illinois
Source: DAI-B 73/08(E), Dissertation Abstracts International
Subjects: Computer science
Keywords: Debugging, Delta excution, Flash worm, Operating systems, Production run, Re-excution, Software reliability
Publication Number: 3503707
ISBN: 978-1-267-27409-0
Copyright © 2020 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy