Safety Ring:Fault-tolerant Distributed Process Execution in OSIRIS

Nenad Stojnić and Heiko Schuldt
Technical Report
Appears in
Technical Report CS-2012-002, Department of Mathematics and Computer Science, University of Basel
The advent of service-oriented architectures (SOAs) has strongly facilitated the development and deployment of large-scale distributed (serviceoriented) applications. The middleware for orchestrating process-based applications that consist of several distributed services has to be inherently distributed as well, in order to provide a high degree of scalability and to avoid a single point of failure. Self-healing execution of such processes supported by a distributed middleware requires replicated control metadata and instance data of processes. Most importantly, replication has to be provided in a way that does not affect the adaptivity and elasticity behavior of the middleware for composite service execution. In this technical report, we introduce OSIRIS Safety Ring, a novel approach to fault-tolerant process execution. Safety Ring is based on OSIRIS, a distributed and decentralized middleware for the execution of composite services. Essentially, the Safety Ring exploits dedicated node monitors, organized in a self-organizing ring structure, for the replication of control data. Moreover, it leverages virtual stable storage for managing process instance data in a robust way. We present the architecture of OSIRIS’ Safety Ring and discuss in detail the algorithms it applies for self-healing process execution. The performance evaluation shows that the additional gain in robustness has only marginal effects on the scalability characteristics of the system.