Stochastic Models for Fault Tolerance

Restart, Rejuvenation and Checkpointing

By Katinka Wolter

Stochastic Models for Fault Tolerance Cover Image

This book details methods of redundancy patterns in time that need to be issued at the right moment in complex computing systems. It introduces the methods, details their stochastic description, and covers aspects of their application in real-world systems.

Full Description

  • ISBN13: 978-3-6421-1256-0
  • 288 Pages
  • User Level: Science
  • Publication Date: June 17, 2010
  • Available eBook Formats: PDF
  • eBook Price: $99.00
Buy eBook Buy Print Book Add to Wishlist
Full Description
As modern society relies on the fault-free operation of complex computing systems, system fault-tolerance has become an indispensable requirement. Therefore, we need mechanisms that guarantee correct service in cases where system components fail, be they software or hardware elements. Redundancy patterns are commonly used, for either redundancy in space or redundancy in time. Wolter’s book details methods of redundancy in time that need to be issued at the right moment. In particular, she addresses the so-called 'timeout selection problem', i.e., the question of choosing the right time for different fault-tolerance mechanisms like restart, rejuvenation and checkpointing. Restart indicates the pure system restart, rejuvenation denotes the restart of the operating environment of a task, and checkpointing includes saving the system state periodically and reinitializing the system at the most recent checkpoint upon failure of the system. Her presentation includes a brief introduction to the methods, their detailed stochastic description, and also aspects of their efficient implementation in real-world systems. The book is targeted at researchers and graduate students in system dependability, stochastic modeling and software reliability. Readers will find here an up-to-date overview of the key theoretical results, making this the only comprehensive text on stochastic models for restart-related problems.
Table of Contents

Table of Contents

  1. Part I: Introduction.
  2. 1) Basic Concepts and Problems
  3. 2) Task Completion Time.
  4. Part II: Restart.
  5. 3) Applicability Analysis of Restart
  6. 4) Moments of Completion Time under Restart
  7. 5) Meeting Deadlines through Restart.
  8. Part III: Software Rejuvenation.
  9. 6) Practical Aspects of Preventive Maintenance and Software Rejuvenation
  10. 7) Stochastic Models for Preventive Maintenance and Software Rejuvenation.
  11. Part IV: Checkpointing.
  12. 8) Checkpointing Systems
  13. 9) Stochastic Models for Checkpointing
  14. 10) Summary, Conclusion and Outlook.
  15. Appendix.
  16. A) Properties in Discrete Systems
  17. B) Important Probability Distributions
  18. C) Estimating the Hazard Time
  19. D) The Laplace and the Laplace
  20. Stieltjes Transform.
Errata

Please Login to submit errata.

No errata are currently published