TY - GEN
T1 - A closer look at fault tolerance
AU - Taubenfeld, Gadi
N1 - Copyright:
Copyright 2012 Elsevier B.V., All rights reserved.
PY - 2012
Y1 - 2012
N2 - The traditional notion of fault tolerance requires that all the correct participating processes eventually terminate, and thus, is not sensitive to the number of correct processes that should properly terminate as a result of failures. Intuitively, an algorithm that in the presence of any number of faults always guarantees that all the correct processes except maybe one properly terminate, is more resilient to faults than an algorithm that in the presence of a single fault does not even guarantee that a single correct process ever terminates. However, according to the standard notion of fault tolerance both algorithms are classified as algorithms that can not tolerate a single fault. To overcome this difficulty, we generalize the traditional notion of fault tolerance in a way which enables to capture more sensitive information about the resiliency of an algorithm. Then, we present several algorithms for solving classical problems which are resilient under the new notion. It is well known that, in an asynchronous systems where processes communicate either by reading and writing atomic registers or by sending and receiving messages, important problems such as, consensus, set-consensus, election, perfect renaming, implementations of a test-and-set bit, a shared stack, a swap object and a fetch-and-add object have no deterministic solutions which can tolerate even a single fault. We show that while, some of these problems have solutions which guarantee that in the presence of any number of faults most of the correct processes will properly terminate; other problems do not even have solutions which guarantee that in the presence of just one fault at least one correct process properly terminates.
AB - The traditional notion of fault tolerance requires that all the correct participating processes eventually terminate, and thus, is not sensitive to the number of correct processes that should properly terminate as a result of failures. Intuitively, an algorithm that in the presence of any number of faults always guarantees that all the correct processes except maybe one properly terminate, is more resilient to faults than an algorithm that in the presence of a single fault does not even guarantee that a single correct process ever terminates. However, according to the standard notion of fault tolerance both algorithms are classified as algorithms that can not tolerate a single fault. To overcome this difficulty, we generalize the traditional notion of fault tolerance in a way which enables to capture more sensitive information about the resiliency of an algorithm. Then, we present several algorithms for solving classical problems which are resilient under the new notion. It is well known that, in an asynchronous systems where processes communicate either by reading and writing atomic registers or by sending and receiving messages, important problems such as, consensus, set-consensus, election, perfect renaming, implementations of a test-and-set bit, a shared stack, a swap object and a fetch-and-add object have no deterministic solutions which can tolerate even a single fault. We show that while, some of these problems have solutions which guarantee that in the presence of any number of faults most of the correct processes will properly terminate; other problems do not even have solutions which guarantee that in the presence of just one fault at least one correct process properly terminates.
KW - consensus
KW - election
KW - estack
KW - fault tolerance
KW - fetch-and-add
KW - message passing
KW - renaming
KW - set-consensus
KW - shared memory
KW - swap
KW - test-and-set swap, fetch-and-add.
UR - http://www.scopus.com/inward/record.url?scp=84865005080&partnerID=8YFLogxK
U2 - 10.1145/2332432.2332484
DO - 10.1145/2332432.2332484
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84865005080
SN - 9781450314503
T3 - Proceedings of the Annual ACM Symposium on Principles of Distributed Computing
SP - 261
EP - 270
BT - PODC'12 - Proceedings of the 2012 ACM Symposium on Principles of Distributed Computing
T2 - 2012 ACM Symposium on Principles of Distributed Computing, PODC'12
Y2 - 16 July 2012 through 18 July 2012
ER -