Dynamic Data Replication in the Grid with Freshness and Correctness Guarantees
This thesis explores architectural issues and performance aspects of data Grid infrastructures. The objective is to develop a scalable infrastructure that is capable to dynamically manage replicated data in the Grid while at the same time providing freshness and correctness guarantees. We propose a decentralized middleware which can be deployed on top on any Grid (or any distributed and heterogeneous infrastructure). The difficulty is to ensure that such an infrastructure can offer scalability, performance and correctness. The overall goal of this thesis is to present a replication mechanism that combines scalability, global correctness and quality of service guarantees in a dynamic way. In the beginning we introduce important aspects of Grid environments and several scenarios from newly emerging eScience applications. These use case scenarios urgently require new integrated approaches to dynamic replication in a data Grid. Our main contribution is the Re:GRIDiT protocol that dynamically manages replicas in the Grid, while at the same time providing freshness and correctness guarantees. The Re:GRIDiT family consists of three different protocols which target the three main problematic aspects identified in current data Grid infrastructures. Inspired by the requirements deduced from these scenarios we concentrate our efforts on the more complex and general case of distributed update transactions on replicated data. We devise a protocol for the correct synchronization of concurrent updates to different updateable replicas and their subsequent propagation to read-only replicas in a completely distributed way. Re:SYNCiT hides the presence of replicas to the applications, takes into account the special characteristics of data in the Grid such as version support, distinction between mutable and immutable objects, and provides provably correct transactional execution guarantees without any global component. The next step is the Re:LOADiT approach to dynamic distributed replica management in data Grid systems. We propose efficient algorithms for selecting optimal locations for placing the replicas so that the load among these replicas is balanced. Given the data usage from each user site and the maximum load of each replica, our algorithm efficiently manages the number of replicas required, reducing or increasing the number of replicas. Until now our approach dictates how update site behave and from a user’s point of view the clients will always access the most up-to-date data. We further refine this approach and introduce the Re:FRESHiT protocol, which allows to effectively trade freshness for performance and addresses freshness and versioning issues, needed in many Grid application domains, without losing consistency. Queries with different freshness levels are cleverly routed along our ree strategy, by taking advantage of the tree structure. Finally we are also interested in the performance characteristics of the presented algorithms. We have implemented the Re:GRIDiT protocols using state-of-the-art Web service technologies which allows an asy and seamless deployment in any Grid environment. The evaluation has been conducted on up to 48 update sites and 48 read-only site. We have used simulated workloads that mimic the behavior expected from our use case applications. Our evaluations have hown that the proposed Re:GRIDiT protocols are efficient, as replicas are created and/or deleted on demand and with a reasonable amount of resources. Dynamic changes in the tree structure allow flexible and efficient query routing along the tree structure. Clever outing strategies ensure an increased performance for queries with different freshness levels. Re:GRIDiT ensures replica consistency and is capable of providing different degrees of consistency and update frequencies. Summarizing, this thesis presents new approaches for the correct synchronization of updates in a dynamic manner, replication management, and freshness guarantees in a data Grid. These approaches are founded on formal theoretical background and implemented in a full-fledged prototype in a real Grid environment. These approaches have been proven to be scalable by means of an extensive analytical and experimental evaluation.