Self-Organizing Distributed Workflow Management

Authors
Nenad Stojnić
Type
PhD Thesis
Date
2015/5
Appears in
PhD Thesis, Department of Mathematics and Computer Science
Location
University of Basel, Switzerland
Abstract
The proliferation of service-oriented architectures in the last decade has brought forward an important class of sophisticated distributed applications that are founded on the idea of composing multiple simple services into a complex, coherent whole. Such applications spanning multiple service invocations can be most effectively realized by means of workflows. When it comes to high performance workflow execution, distribution (outscaling) of services is a key concept and also a very straightforward advantage of the workflow paradigm. Concretely, both the constituent services of the workflow and the system that manages their invocations have to be distributed across an environment of computational devices. In a wide spectrum of applications, that entail heterogeneity of the encompassed computational devices, e.g., modern emergency management, invocations of optimal service instances in conjunction to their reliability are fundamental prerequisites of distributed workflow management. At the center of this thesis is a formal model that defines the distributed (i.e., scalable) execution of workflows. To extend this model for reliability in a novel way, which does not affect the scalability of execution, the Safety-Ring system service is presented. The idea behind Safety-Ring is to offer recovery for a wide range of node failures, which host active services of running workflows. To this end, the Safety-Ring provides a scalable, reliable, and consistent data store that is used for the storage of workflow execution state. The novel failure-recovery mechanism features effective reliability such that can be applied on the nodes that host the Safety-Ring service themselves, thus we say the Safety-Ring is self-healing. To apply the reliable (and distributed) execution model, enhanced by Safety-Ring, to heterogeneous node environments, that are predominantly composed of mobile devices, this thesis presents the Compass data access protocol. In providing scalable data lookup for its maintained data, the Safety-Ring assumes network runtime characteristics which are rather stable, and thus Safety-Ring implicitly optimizes for the number of queried nodes. Especially in mobile applications, where node network connectivity dynamically changes, data access protocols should aim at reducing the overall data lookup latency, rather than the number of queried nodes. Compass introduces latency optimal paths to each node, which dynamically adapt to changing network characteristics. The scalable data lookup of Safety-Ring is not affected. In case distributed execution of workflows spans services of continuous (stateful) type, their reliability is decoupled from the Safety-Ring. Since such service types are predominantly featured by devices of limited resources, novel approaches to resource conservative recovery of failures for continuous services have to be provided. This thesis builds on proven recovery techniques, such as passive-standby, so as to enhance them for redundancy of the continuous state and thus improve the overall reliability of the system. In doing so the redundancy of state is enforced by means of a lightweight, in terms of network overhead, consistency protocol which allows for its application in resource limited node environments. In order to improve the execution performance of distributed workflows, in terms of throughput, this thesis offers a novel concept to services distribution. At the heart of our approach lie decentralized controllers that autonomously perform dynamic reconfiguration of the execution environment in terms of available services. This primarily affects the Safety-Ring service and all application services. Thereby, the goals of the controllers are to prevent workflow execution bottlenecks and unnecessary service deployments that waste resources. Since the controllers are equipped at any node in the system and can affect any other node of the system we say that the distributed workflow execution model is self-optimizing. Finally, all the presented concepts are implemented within the context of the OSIRIS distributed workflow engine and quantitatively evaluated in a series of experiments. The results of experiments confirm the benefits of our concepts for the distributed workflow execution model.
Staff members