Flexible, Policy-Based Data Archiving in the Cloud (PhD Thesis, ongoing)

Flexible, Policy-Based Data Archiving in the Cloud Current distributed database management systems (DDBMS) in the Cloud offer high availability and reduced costs by employing lazy replication of data objects between geographically distributed datacenters. In practice, this leads to the concurrent existence of different versions of the same data objects inside the same distributed database, since stale versions are only eventually replaced by updates. Since storage prices have significantly decreased in recent years, it has not only become possible, but also beneficial for Cloud providers to store all instead of only the most recent versions of data objects. This enables providers to offer their clients advanced services like e.g. time series analyses or restoring the database to a previous state. When the system can assure that of each version, at least one copy is kept, clients can retrieve stale data as well as the most recent data from the same Cloud database service. We call this ,Archiving as a Service" (AaaS). Not only does AaaS increase the amount of versioned data, but also a much larger effort is required for updating, keeping and reading said versions. At the same time, the AaaS approach offers a larger variety of query semantics. While "classic" Cloud databases only allow for accessing the most recent data, AaaS systems offer temporal freshness constraints like, e.g., "I require all data valid at time t", "I need all versions which are not older/younger than t" or "Return to me all versions between t1 and t2". The central challenges in making the step from lazy replication to AaaS are the following. First, the system must track which versions it contains at any time and ensuring that of each version, at least one replica is retained. Second, the system must decide where to place replicas and, during query time, select an appropriate replica which minimizes access costs. Since the system has the possibility to answer queries with different replicas (which come in different freshness degress), it has a large potential for minimizing data management costs. Since in the Cloud, everything comes with a price tag, it is straightforward to assign fine granular costs to data access operations. There is, however, a large number of costs which are not necessarily monetary costs and may be subject to optimization like, e.g., access latency, network/machine load, storage space, etc. Moreover, there are numerous constraints which can apply to archived data, with many of them stemming from organizational or legal policies. In most countries, for example, there are lower and upper bounds on the timespan during which certain data needs to be preserved. Other regulations state that specific data like, e.g. medical records must be physically stored within the country in which it is produced. These constraints make policy enfordement on a per data object basis an important requirement to AaaS systems. Our work meets these challenges as follows. We designed a framework of a policy-driven and modular database management system PolarDBMS [2], which is directed at meeting the current requirements at Cloud DBMS. We are implementing PolarDBMS in the form of a modular, OSGi-based implementation called UBstore, which enables researchers and DBMS developers to create distributed data management systems without making any assumptions about network layout, data schemas etc. We are working on a distributed modular AaaS system called Tempo, which not only allows for traditional temporal queries, but minimizes costs by allowing for users and system providers to specify their requirements on a per-data and per-query basis. Moreover, it offers the full range of the aforementioned freshness queries. All types of queries are answered by making use of ARCTIC (A Replicated Cost-aware Temporal Index for the Cloud) [1], which enables Tempo to select the cheapest replica satisfying any given query. [1] Brinkmann, Filip-Martin, and Heiko Schuldt. "Towards Archiving-as-a-Service: A Distributed Index for the Cost-effective Access to Replicated Multi-Version Data." Proceedings of the 19th International Database Engineering & Applications Symposium. ACM, 2015. [2] Fetai, Ilir, Filip M. Brinkmann, and Heiko Schuldt. "PolarDBMS: Towards a cost-effective and policy-based data management in the cloud." Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on. IEEE, 2014.