Self-Optimizing Cloud Data Partitioning (Master Thesis, Finished)
Data Partitioning is a well-known approach to achieve horizontal scalability. Its main goal is to eliminate or at least minimize distributed transactions by collocating data frequently accessed together on the same node. The most common approach is the manual data partitioning. However, from the programmer point of view the manual partitioning would require the analysis of complex data access patterns. Additionally, if access patterns change, re-partitioning is required, which again may be very time consuming. New approaches such as [1,2, 3,4] delegate the partitioning task to the underlying database management system (DBMS) and thus relief the programmers from this complex and time consuming task. The existing approaches differ in the algorithms used and on the adaptability, i.e. some follow a static approach whereas some others are able to adapt at runtime. The goal of our project is along the line of [1,2,3,4] in the sense that we also do require the DBMS to take over the partitioning. However, we argue that existing approaches lack some very important properties, which are listed below. A partitioning approach should create partitions at runtime based on the analysis of the access patterns without any manual intervention. It should be able to cope with different initial system configurations. It should be able to optimally partition the data starting from a fully replicated or non-optimized partitioned system. It should be adaptable meaning that if the access patterns change; the partitions should be modified to cope with the changes. The adaption time should be minimal. Based on client requirements, it should replicate partitions for high availability. It should optimally place partitions and their replicas so that access latency to data is minimized. The placement should be done by taking different criteria into account. Possible criteria include network distance to the clients, characteristics of the machine hosting the partition, etc. It should be self-healing, i.e. if replicas fail, it should create new ones in order to maintain the desired availability. It should be suitable for the Cloud. It follows that the approach should take the monetary cost into account. The goal of this Masters Thesis is to analyze existing partitioning approaches, their advantages and disadvantages and come up with a novel partitioning scheme, which would fulfill the specified requirements above. Mandatory Tasks Identify other partitioning approaches in addition to [1,2,3, 4]. Define a criteria catalogue for comparing the identified approaches. The criteria catalogue is to be created by the analysis of the existing approaches and the list of desired properties above . Analyse existing data partitioning approaches (at least [1,2,3,4]) according to the criteria catalogue Propose a new partitioning approach The new approach should provide at least following properties Dynamic partitioning at runtime Be adaptive, i.e., if access patterns change it should adapt partitions on the fly. It should however guarantee a stable system state even if access patterns change very rapidly, i.e. avoids oscillations. Replicate for load balancing and failure tolerance Implement the improved partitioning approach Evaluate the approach by using existing benchmarks, such as TPCC or YCSB  Optional Tasks Self-healing, i.e., if replicas fail, it should create new ones in order to maintain the desired availability. Optimal partitions placement based on network distance, monetary cost, load, etc. Provide the possibility of user-feedback for the quality of partitioning. References  Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. 2009. ElasTraS: an elastic transactional data store in the cloud. In Proceedings of HotCloud, 2009.  Carlo Curino, Evan P. C. Jones, Raluca A. Popa, Nirmesh Malviya, Eugene Wu, Samuel Madden, Hari Balakrishnan, Nickolai Zeldovich: Relational Cloud: a Database Service for the cloud. In Proceedings of CIDR, 2011.  P.A. Bernstein et. al: Adapting Microsoft sql server for cloud computing”. In Proceedings of ICDE, 2011  C. Curino et. al: Schism: a workload-driven approach to database replication and partitioning”. In Proceedings of VLDB, 2010.  TPCC: http://www.tpc.org/tpcc/  Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of SoCC, 2010.