The availability of large-scale online services (also known as cloud services) is becoming more important as more private and public infrastructure systems take dependencies on these services. The time it takes to detect that a failure has occurred, and the time taken to determine where the fault has occurred in a system are two important factors in determining the length of downtime experienced by a service. As well, determining where (or what) has failed in a complex system also drives the correct mitigation of the fault. Quickly and correctly identifying a fault in a complex system will reduce the amount of downtime experienced by the system–thus improving its availability. The purpose of this research was to compare the performance of a system (named Nova) that used ad hoc Bayesian networks to detect failures and locate faults in a complex, software as a service, cloud offering that is of global scale to the existing detection and scoping system (named Senex) used by this service. This comparison was done to determine if the use of on demand, ad hoc Bayesian networks were faster at detecting failures and locating faults in the system than Senex while maintaining reasonably accurate predictions of the fault scope. Actual service failure data were used for this study. The time that the Senex system took to detect the failure and scope the fault were recorded. The actual failing component was identified and recorded. A simulation was then conducted using the Nova system. The Nova system was passed actual probe data from the time of the failures and the time to detect and identify the failing component recorded along with the identity of the predicted failed components. It was found that the Nova system was significantly faster at detecting failures and identifying faults and that the Nova system was acceptably accurate in its predictions. More study is needed in creating more complex networks that can identify a larger set of issues than the Nova system was able to detect. Further areas of interest would be in identifying the owner of issues using Bayesian networks and incorporating cost into the analysis.
|Commitee:||Blackburn, Timothy, Malalla, Ebrahim|
|School:||The George Washington University|
|School Location:||United States -- District of Columbia|
|Source:||DAI-B 81/4(E), Dissertation Abstracts International|
|Subjects:||Engineering, Computer science|
|Keywords:||Availability, Bayesian networks, Cloud computing, Failure propagation, Reliability|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be