Squall:Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases Aaron J.Elmore',Vaibhav Arora2,Rebecca Taft Andrew Pavlo,Divyakant Agrawal2.s,Amr El Abbadi wrcSgne8a8mag1Baansne aeimorecschcsb.. ABSTRACT vith the abilit udden shifts This whe e is de (OLTP) high performance and de spre ons.H o h In these o reco overall DBMS.M e ale databases in thi n,we introduce the Squall techniqu a node l ts fin ng of ohpgadinghan vare licated data having to shut down 1.INTRODUCTION disk-ha ar.they are pre The nced for scalable main-memory DBMSs ismo vated by de This me rent data ns.Such systems eschew leg s19,351 tional DBMSs 37]and alleviate the ntion on shared data the DBMS ep train o more,modern web/mobile applications require always-available data ches 137 f all ssor physical loggng full cit the f h or specific pe issions@acm.平 then instead on to a new nod a rights licensed to ACM ut)approach dynamic allows the system to continue to process it mirate
Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases Aaron J. Elmore1 , Vaibhav Arora2 , Rebecca Taft3 Andrew Pavlo4 , Divyakant Agrawal2,5 , Amr El Abbadi2 1University of Chicago, 2University of California, Santa Barbara 3MIT CSAIL, 4Carnegie Mellon University, 5Qatar Computing Research Institute aelmore@cs.uchicago.edu, {vaibhavarora,agrawal,amr}@cs.ucsb.edu, rytaft@mit.edu, pavlo@cs.cmu.edu ABSTRACT For data-intensive applications with many concurrent users, modern distributed main memory database management systems (DBMS) provide the necessary scale-out support beyond what is possible with single-node systems. These DBMSs are optimized for the short-lived transactions that are common in on-line transaction processing (OLTP) workloads. One way that they achieve this is to partition the database into disjoint subsets and use a single-threaded transaction manager per partition that executes transactions one-ata-time in serial order. This minimizes the overhead of concurrency control mechanisms, but requires careful partitioning to limit distributed transactions that span multiple partitions. Previous methods used off-line analysis to determine how to partition data, but the dynamic nature of these applications means that they are prone to hotspots. In these situations, the DBMS needs to reconfigure how data is partitioned in real-time to maintain performance objectives. Bringing the system off-line to reorganize the database is unacceptable for on-line applications. To overcome this problem, we introduce the Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs. Squall supports fine-grained repartitioning of databases in the presence of distributed transactions, high throughput client workloads, and replicated data. An evaluation of our approach on a distributed DBMS shows that Squall can reconfigure a database with no downtime and minimal overhead on transaction latency. 1. INTRODUCTION The need for scalable main-memory DBMSs is motivated by decreasing memory costs and an increasing demand for high-throughput transaction processing systems. Such systems eschew legacy diskoriented concurrency control and recovery mechanisms of traditional DBMSs [37] and alleviate the contention on shared datastructures [17,26,31,40]. But some OLTP databases are larger than the amount of DRAM that is available on a single node. Furthermore, modern web/mobile applications require always-available data Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2758-9/15/05 ...$15.00. http://dx.doi.org/10.1145/2723372.2723726. with the ability to support sudden shifts in access patterns. The inability to react to changes in usage or mitigate potential downtime can cause significant financial losses for service providers [7]. This argues for the use of a distributed DBMS architecture where the database is deployed in memory-only storage on a cluster of shared-nothing nodes. These scalable DBMSs, colloquially known as NewSQL [10], achieve high performance and scalability without sacrificing the benefits of strong transactional guarantees by spreading the databases across nodes into disjoint partitions. Recent examples of these systems include H-Store [24] (and its commercial version VoltDB [6]), MemSQL [2], and SQLFire [5]. Even if a database resides entirely in memory across multiple partitions, the DBMS is not immune to problems resulting from changes in workload demands or access patterns. For example, sudden increases in the popularity of a particular item in the database can negatively impact the performance of the overall DBMS. Modern distributed systems can, in theory, add and remove resources dynamically, but in practice it is difficult to scale databases in this manner [23]. Increasing system capacity involves either scaling up a node by upgrading hardware or scaling out by adding additional nodes to the system in order to distribute load. Either scenario can involve migrating data between nodes and bringing nodes off-line during maintenance windows [18]. Previous work has shown how to migrate a database from one node to another incrementally to avoid having to shut down the system [8, 15, 19, 35]. These approaches, however, are not suitable for partitioned main-memory DBMSs. In particular, they are predicated upon disk-based concurrency control and recovery mechanisms. This means that they rely on concurrent data access to migrate data using heavy-weight two-phase locking and snapshot isolation methods to ensure correctness [19, 35]. More proactive migration techniques rely on physical logging from standard replication and recovery mechanisms [8, 15]. But this dependency on the DBMS’s replication infrastructure places additional strain on partitions that may be already overloaded. Above all, these approaches are inappropriate for partitioned main-memory DBMSs that execute transactions serially [37], since the system does not support concurrent access or physical logging. Even if one adapted these approaches to move partitions between nodes, they are still not able to split one partition into multiple partitions. For example, if there is a particular entity in a partition that is extremely popular (e.g., the Wu Tang Clan’s Twitter account), then instead of migrating the entire partition to a new node it is better to move those individual tuples to their own partition [38]. A better (but more difficult) approach is to dynamically reconfigure the physical layout of the database while the system is live. This allows the system to continue to process transactions as it migrates
H-Store is optimized for the efficient execution of workload H-Store Cluster Eacontain -Store Nod ind (2)contr e the I input pa ecutes is s its bas 1331.The base o that is itecture from 33 le th Figure 1:The H-Sto tis responsible for executing tran quenes for a-time hased on the order of the ival ti nce 201.But accomplishing this ise when the DBMS action tha granted the lock and (b)it ha 3 t lea ures that distributed t sume tha ard clock-skew s CP ach tage nt the Squall ese sys al DBMS e da of th approach.h ntigurationbyafeyint slow.If a tr hat reduc DBMS ed th system's perfoma ance usin then res re a DB in thro tion t for short-lived u ns 12 hitecture ctions' and persistent. of data s.the DBMS fault tol e is en d in S nd hich ition is fulb work in Section 8.and conclude in Section 9 2.BACKGROUND 2.2 Database Partitioning et in thiswork ms (i.e replication is 2.1 H-Store Architecture othe H-Store is a distributed,row-oriented DBMS that supports se partition from its that manages one or more partitions range.or round-robin partitioning 6).For this paper.we modi-
H-Store Node BadAss Chip 3000 Squirrels. Yes, Squirrels BadAss Chip 3000 Squirrels. Yes, Squirrels ... Partition Data Partition Data Execution Engine Execution Engine Txn Coordinator H-Store Cluster Client Application Core Core Procedure Name Input Parameters BadAss Chip 3000 Squirrels. Yes, Squirrels BadAss Chip 3000 Squirrels. Yes, Squirrels ... BadAss Chip 3000 Squirrels. Yes, Squirrels BadAss Chip 3000 Squirrels. Yes, Squirrels ... BadAss Chip 3000 Squirrels. Yes, Squirrels BadAss Chip 3000 Squirrels. Yes, Squirrels ... BadAss Chip 3000 Squirrels. Yes, Squirrels BadAss Chip 3000 Squirrels. Yes, Squirrels ... BadAss Chip 3000 Squirrels. Yes, Squirrels BadAss Chip 3000 Squirrels. Yes, Squirrels ... BadAss Chip 3000 Squirrels. Yes, Squirrels BadAss Chip 3000 Squirrels. Yes, Squirrels ... Figure 1: The H-Store architecture from [33]. data and immediately relieves contention on hotspots. Some distributed NoSQL DBMSs, such as MongoDB [3], support splitting and migration of partitions to new nodes when the system needs to re-balance [20]. But accomplishing this is easier when the DBMS does not support atomic operations on multiple objects. Another approach is to pre-allocate multiple “virtual” partitions for each real partition at start-up and then migrate some of the virtual partitions to new nodes for re-balancing the load [34]. The downside of this is that the DBMS has no control of the contents of these virtual partitions, and thus there is no way to know whether the migration will result in the desired change in performance until after the virtual partitions have been migrated. To the best of our knowledge, no DBMS today supports the fine-grained, tuple-level load balancing that is needed for the system to be truly autonomous. Given the lack of solutions for this problem, we present the Squall migration system for partitioned OLTP DBMSs. Our key contribution in this paper is an efficient mechanism for performing finegrained live reconfiguration by safely interleaving data migration with executing transactions. We also present several optimizations that reduce the migration costs for a variety of workloads. To evaluate our work, we implemented Squall in the H-Store [1] DBMS and measured the system’s performance using two OLTP workloads. Our results demonstrate that Squall is able to reconfigure a DBMS with no downtime and a minimal decrease in throughput. The rest of the paper is organized as follows. We start in Section 2 with an overview of the DBMS architecture targeted by our work and the reconfiguration scenarios that are important in this operating environment. We then present Squall in Section 3, discuss the details of data reconfiguration management in Section 4, and outline several optimizations in Section 5. Section 6 outlines how fault tolerance is enabled in Squall. We then provide a thorough evaluation of Squall in Section 7. Finally, we describe related work in Section 8, and conclude in Section 9. 2. BACKGROUND We begin with an overview of the architecture of H-Store, an example of the type of distributed DBMS that we target in this work. We then show how these DBMSs are susceptible to load imbalances and how a naïve migration approach to reconfiguring a database is insufficient. Although we use H-Store in our analysis, our work is applicable to any partitioned, main memory OLTP DBMS. 2.1 H-Store Architecture H-Store is a distributed, row-oriented DBMS that supports serializable execution of transactions over main memory partitions. We define an H-Store instance as a cluster of two or more nodes deployed within the same administrative domain. A node is a single physical computer system that contains a transaction coordinator that manages one or more partitions. H-Store is optimized for the efficient execution of workloads that contain transactions invoked as pre-defined stored procedures. Each stored procedure is comprised of (1) parameterized queries and (2) control code that contains application logic intermixed with invocations of those queries. Client applications initiate transactions by sending the procedure name and input parameters to any node in the cluster. The partition where the transaction’s control code executes is known as its base partition [33]. The base partition ideally will have most (if not all) of the data the transaction needs [32]. Any other partition involved in the transaction that is not the base partition is referred to as a remote partition. As shown in Fig. 1, each partition is assigned a single-threaded execution engine that is responsible for executing transactions and queries for that partition. A partition is protected by a single lock managed by its coordinator that is granted to transactions one-ata-time based on the order of their arrival timestamp [9, 13, 41]. A transaction acquires a partition’s lock if (a) the transaction has the lowest timestamp that is not greater than the one for the last transaction that was granted the lock and (b) it has been at least 5 ms since the transaction first entered the system [37]. This wait time ensures that distributed transactions that send their lock acquisition messages over the network to remote partitions are not starved. We assume that standard clock-skew algorithms are used to keep the nodes’ CPU clocks synchronized. Executing transactions serially at each partition has several advantages for OLTP workloads. In these applications, most transactions only access a single entity in the database. That means that these systems are much faster than a traditional DBMS if the database is partitioned in such a way that most transactions only access a single partition [32]. The downside of this approach, however, is that transactions that need to access data at two or more partitions are slow. If a transaction attempts to access data at a partition that it does not have the lock for, then the DBMS aborts that transaction (releasing all of the locks that it holds), reverts any changes, and then restarts it once the transaction re-acquires all of the locks that it needs again. This removes the need for distributed deadlock detection, resulting in better throughput for short-lived transactions [22]. All data in H-Store is stored in main memory. To ensure that transactions’ modifications are durable and persistent, each node writes asynchronous snapshots of the entire database to disk at fixed intervals [25,37]. In between these snapshots, the DBMS writes out a record to a redo-only command log for each transaction that completes successfully [27]. In addition to snapshots and command logging, main memory databases often use replication to provide durability and high availability. Each partition is fully replicated by another secondary partition that is hosted on a different node [27]. 2.2 Database Partitioning A partition plan for a database in a distributed OLTP DBMS is comprised of (1) partitioned tables, (2) replicated tables, and (3) transaction routing parameters [32]. A table can be horizontally partitioned into disjoint fragments whose boundaries are based on the values of one (or more) of the table’s columns (i.e., partitioning attributes). Alternatively, the DBMS can replicate non-partitioned tables across all partitions. This table-level replication is useful for read-only or read-mostly tables that are accessed together with other tables but do not partition in accordance with other tables. A transaction’s routing parameters identify the transaction’s base partition from its input parameters. Administrators deploy databases using a partition plan that minimizes the number of distributed transactions by collocating the records that are used together often in the same partition [14, 32]. A plan can be implemented in several ways, such as using hash, range, or round-robin partitioning [16]. For this paper, we modi-
ter.We postpone the details of these workloads and the execution E-Storeus s system-level statistics (e.g sustained high E-Store data.showing WAREHOUSE and CUSTOMER PC-C te ed me es experien ughput due to co fied H-Store to use range partitioning.We discuss how to suppor tive partitioning se In building E-Stor W ID).S The Wapi SE table is parti ioned by the lly pull large of tuples.As s and CUSTO)co LTP sy paramter.Since there sforeign-key coordination betw apmoednemgioe a. tem that secks to minimize this simplified example throughout the paper for exposition 2.3 The Need for Live Reconfiguration ease data Ir hat use a heavyweight the same (from one partition partition). manc anges in workload access pattems 32 ions or pa 3.OVERVIEW OF SQUALL o demonstrate the detr d partition clus ized and fault-toler r to creat n the pr BMS h sume hat a tuple doe oes)or false post tives (i.e the system as
10467 W_ID City Zip 1 Miami 33132 2 Seattle 98101 C_ID Name W_ID 14 Ron 1 2 Wyatt 2 12 Jack 1 Warehouse Customer Partition 1 W_ID City Zip 3 New York 10467 4 Chicago 60414 C_ID Name W_ID 1 Mike 3 1004 Gabriel 3 3 Dean 3 Warehouse Customer Partition 2 W_ID City Zip 5 Los Angeles 90001 7 San Diego 92008 C_ID Name W_ID 21 Snake 5 7 R.J. 5 4 Stephen 7 Warehouse Customer Partition 3 W_ID City Zip 10 Austin 78702 C_ID Name W_ID 9 Todd 10 Warehouse Customer Partition 4 Figure 2: Simple TPC-C data, showing WAREHOUSE and CUSTOMER partitioned by warehouse IDs. 0% 20% 40% 60% 80% Percent of New Orders for Warehouse 1-3 0 5000 10000 15000 TPS Figure 3: As workload skew increases, the number of new order transactions increasingly access 3 warehouses in TPC-C and the collocated warehouses experience reduced throughput due to contention. fied H-Store to use range partitioning. We discuss how to support alternative partitioning schemes in Appendix C. Fig. 2 shows a simplified TPC-C database partitioned by the plan in Fig. 5a. The WAREHOUSE table is partitioned by its id column (W_ID). Since there is a foreign key relationship between the WAREHOUSE and CUSTOMER tables, the CUSTOMER table is also partitioned by its W_ID attribute. Hence, all data related to a given W_ID (i.e., both WAREHOUSE and CUSTOMER) are collocated on a single partition. Any stored procedure that reads or modifies either table will use W_ID as its routing parameter. Since there is a foreign-key relationship between CUSTOMER and WAREHOUSE, the CUSTOMER table does not need an explicit mapping in the partition plan. We will use this simplified example throughout the paper for exposition. 2.3 The Need for Live Reconfiguration Although partitioned DBMSs like H-Store execute single-partition transactions more efficiently than systems that use a heavyweight concurrency control scheme, they are still susceptible to performance degradations due to changes in workload access patterns [32]. Such changes could either cause a larger percentage of transactions to access multiple partitions or partitions to grow larger than the amount of memory available on their node. As with any distributed system, DBMSs need to react to these situations to avoid becoming overloaded; failing to do so in a timely manner can impact both performance and availability in distributed DBMSs [20]. To demonstrate the detrimental effect of overloaded partitions on performance, we ran a micro-benchmark using H-Store with a three-node cluster. For this experiment, we used the TPC-C benchmark with a 100 warehouse database evenly distributed across 18 partitions. We modified the TPC-C workload generator to create a hotspot on one of the partitions by having a certain percentage of transactions access one of three hot warehouses instead of a uniformly random warehouse. Transaction requests are submitted from up to 150 clients running on a separate node in the same cluster. We postpone the details of these workloads and the execution environment until Section 7. As shown in Fig. 3, as the warehouse selection moves from a uniform to a highly skewed distribution, the throughput of the system degrades by ∼60%. This shows that overloaded partitions have a significant impact on the throughput of a distributed DBMS like H-Store. The solution is for the DBMS to respond to these adverse conditions by migrating data to either re-balance existing partitions or to offload data to new partitions. Some of the authors designed E-Store [38] for automatically identifying when a reconfiguration is needed and to create a new partition plan to shuffle data items between partitions. E-Store uses system-level statistics (e.g., sustained high CPU usage) to identify the need for reconfiguration, and then uses tuple-level statistics (e.g., tuple access frequency) to determine the placement of data to balance load across partitions. E-Store relies on Squall to execute the reconfiguration. Both components view each other as a black-box; E-Store only provides Squall with an updated partition plan and a designated leader node for a reconfiguration. Squall makes no assumptions on the plans generated by a system controller other than that all tuples must be accounted for. 0 20 40 60 80 100 120 Elapsed Time (seconds) 0 5000 10000 15000 TPS Figure 4: A Zephyr-like migration on two TPC-C warehouses to alleviate a hot-spot effectively causes downtime in a partitioned main-memory DBMS. In building E-Store we evaluated a Zephyr-like migration for load-balancing (cf. Section 7 for a detailed description) where destination partitions reactively migrate tuples as needed and periodically pull large blocks of tuples. As shown by Fig. 4, however, this approach results in downtime for the system, and is therefore not an option for modern OLTP systems. This disruption is due to migration requests blocking transaction execution and a lack of coordination between partitions involved in the migration. These issues highlight the demand for a live reconfiguration system that seeks to minimize performance impact and eliminate downtime. We define a live reconfiguration as a change in the assignment of data to partitions in which data is migrated without any part of the system going off-line. A reconfiguration can cause the number of partitions in the cluster to increase (i.e., data from existing partitions are sent to a new, empty partition), decrease (i.e., data from a partition being removed is sent to other existing partitions), or stay the same (i.e., data from one partition is sent to another partition). 3. OVERVIEW OF SQUALL Squall is a technique for efficiently migrating fine-grained data in a strongly consistent distributed OLTP DBMS. The key advantage of Squall over previous approaches is that it does not require the DBMS to halt execution while all data migrates between partitions, thereby minimizing the impact of reconfiguration. The control of data movement during a reconfiguration is completely decentralized and fault-tolerant. Squall is focused on the problem of how to perform this reconfiguration safely. In particular, Squall ensures that during the reconfiguration process the DBMS has no false negatives (i.e., the system assumes that a tuple does not exist at a partition when it actually does) or false positives (i.e., the system assumes that a tuple exists
figuration and does not migrate any data.Since the reconfiguration se(D)( on.the 3.2 Data Migration The easiest way to safely transfer data in a distributed DBMS is at a partition when it does no). tions.For ing th system is still ex cuting tran this dat any lost or duplicated dat o over n of m This 3.1 Initialization If the s data is d pull me od intr uce ency for tr otified by an exter e.)the d on ation,(2)acti I data in al tra whether it is allowed to start the nfiguration s.This lock the tatus of the uple it is migrat ous nother partition then theshes 3.3 Termination ntralized tor of dat nce all of the partitions agree tos has sent nd/o art the reco tion (out on (in ning attri As we in S Each and exits the re 4.MANAGING DATA MIGRATION fter a partition compl The migration of data between partitions in a transactionally safe hen the le all on
plan:{ "warehouse (W_ID)": { "Partition 1" : [0-3) "Partition 2" : [3-5) "Partition 3" : [5-9) "Partition 4" : [9- ) }} (a) Old Plan plan:{ "warehouse (W_ID)": { "Partition 1": [0-2) "Partition 2": [3-5) "Partition 3": [2-3),[5-6) "Partition 4": [6- ) }} (b) New Plan Figure 5: An example of an updated partition plan for the TPC-C database shown in Fig. 2. at a partition when it does not). Determining when a reconfiguration should occur and how the partition plan should evolve are addressed in the E-Store project [38]. We assume that a separate system controller monitors the DBMS and then initiates the recon- figuration process by providing the system with the new partition plan when appropriate [18, 38]. Squall processes a live reconfiguration in three stages: (1) initializing the partitions’ tracking data structures, (2) migrating data between partitions, and (3) identifying when the reconfiguration process has terminated. In this section, we provide an overview of these three steps. We discuss Squall’s data migration protocol in further detail in Section 4. We then present optimizations of this process in Section 5, such as eager data requests and a divide-andconquer technique for reconfiguration plans. 3.1 Initialization The initialization phase synchronously engages all of partitions in the cluster to start a new reconfiguration. As part of this step, each partition prepares for reconfiguration and identifies the tuples it will be exchanging with other partitions. A reconfiguration starts when the DBMS is notified by an external controller that the system needs to re-balance. This notification identifies (1) the new partition plan for the database and (2) the designated leader node for the operation. The leader is any node in the cluster that contains a partition affected by the reconfiguration. If the reconfiguration calls for a new node to be added to the cluster, then that node must be on-line before the reconfiguration can begin. To initiate the reconfiguration, the leader invokes a special transaction that locks every partition in the cluster and checks to see whether it is allowed to start the reconfiguration process. This locking is the same as when a normal transaction needs to access every partition in the system. The request is allowed to proceed if (1) the system has terminated all previous reconfigurations and (2) the DBMS is not writing out a recovery snapshot of the database to disk. If either of these conditions is not satisfied, then the transaction aborts and is re-queued after the blocking operation finishes. This ensures that all partitions have a consistent view of data ownership and prevents deadlocks caused by concurrent reconfigurations. Once all of the partitions agree to start the reconfiguration, they then enter a “reconfiguration mode” where each partition examines the new plan to identify which tuples are leaving the partition (outgoing) and which tuples will be moving into the partition (incoming). These incoming and outgoing tuples are broken into ranges based on their partitioning attributes. As we discuss in Section 5, this step is necessary because Squall may need to split tuple ranges into smaller chunks or split the reconfiguration into smaller subreconfigurations for performance reasons. After a partition completes this local data analysis, it notifies the leader and waits to learn whether the reconfiguration will proceed. If all of the partitions agree to proceed with the reconfiguration, then the leader sends out acknowledgement to all of the partitions to begin migrating data. Squall only uses this global lock during the initialization phase to synchronize all partitions to begin recon- figuration and does not migrate any data. Since the reconfiguration transaction only modifies the meta-data related to reconfiguration, the transaction is extremely short and has a negligible impact on performance. For all our trials in our experimental evaluation, the average length of this initialization phase was ∼130 ms. 3.2 Data Migration The easiest way to safely transfer data in a distributed DBMS is to stop executing transactions and then move data to its new location. This approach, known as stop-and-copy, ensures that transactions execute either before or after the transfer and therefore have a consistent view of the database. But shutting down the system is unacceptable for applications that cannot tolerate downtime, so stop-and-copy is not an option. It is non-trivial, however, to transfer data while the system is still executing transactions. For example, if half of the data that a transaction needs has already been migrated to a different partition, it is not obvious whether it is better to propagate changes to that partition or restart the transaction at the new location. The challenge is in how to coordinate this data movement between partitions without any lost or duplicated data, and with minimal impact to the DBMS’s performance. To overcome these problems, Squall tracks the location of migrating tuples at each partition during the reconfiguration process. This allows each node’s transaction manager to determine whether it has all of the tuples that are needed for a particular transaction. If the system is uncertain of the current location of required tuples, then the transaction is scheduled at the partitions where the data is supposed to be according to the new plan. Then when the transaction attempts to access the tuples that have not been moved yet, Squall will reactively pull data to the new location [19, 35]. Although this on-demand pull method introduces latency for transactions, it has four benefits: (1) it always advances the progress of data migration, (2) active data is migrated earlier in the reconfiguration, (3) it requires no external coordination, and (4) it does not incur any downtime to synchronize ownership metadata. In addition to the on-demand data pulls, Squall also asynchronously migrates additional data so that the reconfiguration completes in a timely manner. All of these migration requests are executed by a partition in the same manner as regular transactions. Each partition is responsible for tracking the progress of migrating data between itself and other partitions. In other words, each partition only tracks the status of the tuples it is migrating. This allows it to identify whether a particular tuple is currently stored in the local partition or whether it must retrieve it from another partition. 3.3 Termination Since there is no centralized controller in Squall that monitors the process of data migration, it up to each partition to independently determine when it has sent and/or received all of the data that it needs. Once a partition recognizes that it has received all of the tuples required for the new partition plan, it notifies the current leader that the data migration is finished at that partition. When the leader receives acknowledgments from all of the partitions in the cluster, it notifies all partitions that the reconfiguration process is complete. Each partition removes all of its tracking data structures and exits the reconfiguration mode. 4. MANAGING DATA MIGRATION The migration of data between partitions in a transactionally safe manner is the most important feature of Squall. As such, we now discuss this facet of the system in greater detail. We first describe how Squall divides each partition’s migrating tuples into ranges and tracks the progress of the reconfiguration
take for a partition difficult Since these tables are not partitic 是→ f cus tracke The pproach is that artniononyhd amount of s al state that Squall mai ntains during the recon d on these We then de ribe the two ways that Squal data 1L This alows the Section5.including how to split rangesmoun o e the am 4.2 Tracking Reconfiguration Progress are th o help explain this pro partition to record the curent status of these range that tupl NOT STARTED:The data as ociated with the range has n that rec PARTIAL:Son ome of tween the 4.1 Identifying Migrating Data COMPLETE:All of the data for the range has migrated to the des- When a new reconfiguration begins.Squall calculates the differ tination partition and ou This Each partition locally anizes the ranges into incor SELECT FROM MAREHOUSE WHERE W_ID 6 AND W_ID 7 ing warehouses for partition3 are noted as edfor this tra ers with wI WAREHOUSE,W_ID [2.3).13) )into the following ve areh E,1D=6,o.3→4 in these That is.an mple.the sends all customers with greater to partition4: the key which p C5 CHER.W_ID=6,6o.3→ of transacti ions that All of the tables in TPC-C that are partitioned on their WAREHOUSE time than scanning the partition plan entres to de
Reconfiguration Pull W_ID=2 Pull W_ID>5 Warehouse W_ID: 3-4 Customer W_ID: 3-4 Txns Partition 2 Warehouse W_ID: 0-2 Customer W_ID: 0-2 Txns Partition 1 Warehouse W_ID: 5-8 Customer W_ID: 5-8 Txns Partition 3 Warehouse W_ID: 9- Customer W_ID: 9 Txns Partition 4 Warehouse W_ID: 3-4 Customer W_ID: 3-4 Txns Partition 2 Warehouse W_ID: 0-1 Customer W_ID: 0-1 Txns Partition 1 Warehouse W_ID: 2,5 Customer W_ID: 2,5 Txns Partition 3 Warehouse W_ID: 6- Customer W_ID: 6- Txns Partition 4 Squall State Squall State Squall State Squall State Incoming: WHouse: (2) Customer: (2) Outgoing: WHouse: [6,9) Customer: [6,9) Figure 6: As a system’s partition plan changes, Squall tracks the progress of the reconfiguration at each node to ensure correct data ownership. based on these ranges. We then describe the two ways that Squall moves data: reactive migration and asynchronous migration. The former moves tuples when transactions need them. This allows the system to migrate hot tuples early in the reconfiguration without the use of complex usage modeling [36]. The latter is when the system periodically sends data to ensure that the reconfiguration eventually completes with minimal impact on the DBMS’s performance. To help explain this process, we will refer to Figs. 5 and 6 as our running example. For a particular tuple that is migrating from one partition to another, we refer to the partition that is losing that tuple as the source partition and the partition that receives the tuple as the destination partition. Although a partition can be both a source and destination during a reconfiguration, we refer to a partition as either a source or destination for a particular tuple. 4.1 Identifying Migrating Data When a new reconfiguration begins, Squall calculates the difference between the original partition plan and the new plan to determine the set of incoming and outgoing tuples per partition. This allows a partition to determine when it can safely execute a transaction that accesses migrating data without any external metadata. Migrating tuples are organized into reconfiguration ranges that specify a table name, the partitioning attribute(s) of the table, the minimum-inclusive key, maximum-exclusive key, and the old/new partition IDs. Tables in distributed OLTP DBMSs are partitioned by one or more columns [14, 32]. Without loss of generality, however, we discuss reconfiguration ranges for the single column case. Each partition locally organizes the ranges into incoming and outgoing ranges based on the previous and new partition IDs. For example, in the reconfiguration shown in Figs. 5 and 6, the incoming warehouses for partition 3 are noted as: (WAREHOUSE, W_ID = [2, 3), 1 → 3) This means that partition 3 receives warehouse 2 from partition 1. Likewise, the following range identifies that partition 3 sends all warehouses with an ID of 6 or greater to partition 4: (WAREHOUSE, W_ID = [6, ∞), 3 → 4) These rules cascade for all tables that are not explicitly listed in these partition plan entries. That is, any table with a foreignkey relationship to one of the tables identified in an entry will have its tuples migrated based on these rules as well. In our TPC-C example, the CUSTOMER table is partitioned by its WAREHOUSE ID, thus partition 3 would also have the following implicit rule that sends all customers with a W_ID of 6 or greater to partition 4: (CUSTOMER, W_ID = [6, ∞), 3 → 4) All of the tables in TPC-C that are partitioned on their WAREHOUSE id, such as DISTRICT, ORDERS, and STOCK, are handled similarly. We note that this makes predicting how long the migration will take for a partition difficult. Since these tables are not partitioned by a primary or unique key, the number of tuples associated with a range can be far larger than the cardinality of the partition keys encapsulated by the range (e.g., there can be thousands of customers associated with a single W_ID). Each of the above ranges is derived deterministically from the original and new partition plans. This means that each partition can independently calculate its local set of incoming and outgoing ranges from the updated plan that it received in the initialization phase. The advantage of this approach is that a partition only has to track the list of ranges migrating to or from itself. This reduces the amount of global state that Squall maintains during the reconfiguration and facilitates several other enhancements to improve performance. We discuss these additional optimizations for this phase in Section 5, including how to split ranges to reduce the amount of data associated with each of them. 4.2 Tracking Reconfiguration Progress Squall tracks the progress of data migration for all incoming and outgoing ranges at a particular partition. It maintains a table at each partition to record the current status of these ranges: NOT STARTED: The data associated with the range has not yet migrated to/away from this partition, and therefore all data associated with the range is located at the source partition. PARTIAL: Some of the data for the range has been migrated, and some of the tuples may be currently in-flight between the source and destination partitions. COMPLETE: All of the data for the range has migrated to the destination partition. Continuing with our example from Section 4.1, a status of NOT STARTED for the CUSTOMER table range indicates that all customers with a W_ID of 6 or greater are present only at partition 3. This means that any transaction that needs to access the tuples for these customers will do so at partition 3. Since a transaction could execute a query that contains a predicate at a different granularity than the reconfiguration range, Squall allows for the initial reconfiguration ranges to be split into subranges by a query. For the same WAREHOUSE range example with a NOT STARTED status, assume that a transaction arrives at partition 4 that executes the following query: SELECT * FROM WAREHOUSE WHERE W_ID >= 6 AND W_ID <= 7 The original range includes customers with W_ID > 7, thus the entire range is not needed for this transaction. In this scenario, partition 4 would split the original range [6, ∞) into the following two ranges both with the status of NOT STARTED: (WAREHOUSE, W_ID = [6, 8), 3 → 4) (WAREHOUSE, W_ID = [8, ∞), 3 → 4) When partition 4 requests the data associated with the first range, partition 3 similarly splits its original range to reflect the requested range. Once the data for the sub-range is migrated, both partitions update the status of the first range to COMPLETE. For transactions with queries that access an individual key through an equality predicate (e.g., W_ID = 7), Squall must find the range that the key belongs to in its tracking table to determine which partition has that data. Since many OLTP workloads are comprised of transactions that access tuples through single keys, Squall also supports recording the movement of individual tuples at the key level through its tracking table. This enables faster lookups at runtime than scanning the partition plan entries to determine whether