The End of Slow Networks: It's Time for a Redesign [Vision] Carsten Binnig Andrew Crotty Alex Galakatos Tim Kraska Erfan Zamanian Brown University.firstname lastnameabrown.edu ABSTRACT The nextrtioofhig-pDMAcpable of modern dist ribut Th e syste orkisthe and thus mustbe h the netw Mo (a)Specification (b)Experiment r.we first argue tha t the Figure 1:Memory vs Network Bandwidth (a)spec. vantage of fast netw gest a new archite y and tw markable performance improvements over existing designs. a了 L INTRODUCTION We argue that the current trend the netwo dwidt Band FDR/EDR. com ect that Ethernet important factor is that with major ent ex the d the 寸 Yet,with the is in th Instead,cache and men orv-locality wil 25 GB/s(DDR3 hab-table) dt 1.7 GB/s (FDE At thes tim pl ure 1(a)).Moreover,future Infir all clus E5v2 CPUs p width of the local me the band FDR of D R-1600 ory,and one 20 itch and NICs.I 1600 memory has th tota to the one memory channel (136GB/s for alf-duple Band and PCle that write workload.Figure b)shows the theoretical (eft) 4 We do (1)the RDMA memory acces in a NUMA architecture:(2)the latency be
The End of Slow Networks: It’s Time for a Redesign [Vision] Carsten Binnig Andrew Crotty Alex Galakatos Tim Kraska Erfan Zamanian Brown University, firstname lastname@brown.edu ABSTRACT The next generation of high-performance RDMA-capable networks requires a fundamental rethinking of the design of modern distributed in-memory DBMSs. These systems are commonly designed under the assumption that the network is the bottleneck and thus must be avoided as much as possible. This assumption no longer holds true. With In- finiBand FDR 4x, the bandwidth available to transfer data across the network is in the same ballpark as the bandwidth of one memory channel, and the bandwidth increases even more with the most recent EDR standard. Moreover, with increasing advances in RDMA, transfer latencies improve similarly fast. In this paper, we first argue that the “old” distributed database design is not capable of taking full advantage of fast networks and suggest a new architecture. Second, we discuss initial results of a prototype implementation of this architecture for OLTP and OLAP, and show remarkable performance improvements over existing designs. 1. INTRODUCTION We argue that the current trend towards high-performance Remote Direct Memory Access (RDMA) capable networks, such as InfiniBand FDR/EDR, will require a complete redesign of modern distributed in-memory DBMSs, which are built on the assumption that the network is the main bottleneck [9]. Consequently, these systems aim to avoid communication between machines, using techniques such as localityaware partitioning schemes [50, 46, 19, 63], semi-reductions for joins [52], and complicated preprocessing steps [48, 54]. Yet, with the nascent modern network technologies, the assumption that the network is the bottleneck no longer holds. Even today, with InfiniBand FDR 4× [8], the bandwidth available to transfer data across the network is in the same ballpark as the bandwidth of one memory channel. DDR3 memory bandwidth currently ranges from 6.25 GB/s (DDR3- 800) to 16.6 GB/s (DDR3-2133) [1] per channel, whereas InfiniBand has a specified bandwidth of 1.7 GB/s (FDR 1×) to 37.5GB/s (EDR 12×) [8] per NIC port (see Figure 1(a)). Moreover, future InfiniBand standards (HDR as well as NDR) promise a bandwidth that exceeds the bandwidth of the local memory bus by far. However, modern systems typically support 4 memory channels per socket. For example, a machine with DDR3- 1600 memory has 12.8GB/s per channel, with a total aggregate memory bandwidth of 51.2GB/s, and 4 dual-port FDR 4× NICs provide roughly the same bandwidth.1 Even more surprisingly, the CPU-memory bandwidth is half-duplex, while InfiniBand and PCIe are full-duplex, such that only 2 NICs could saturate the memory bandwidth of a read- /write workload. Figure 1(b) shows the theoretical (left) 1We do not assume that the PCIe bus becomes a bottleneck, as current dual socket Xeon e5 boards typically have 40 Gen3 lanes per socket, achieving 39.4 GB/s total bandwidth. 0 5 10 15 20 25 30 35 40 1x 4x 12x 1x 4x 12x 1x 4x Dual 4x 12x 1x 4x 12x 1333 1600 1866 2133 QDR FDR-10 FDR EDR DDR3 InfiniBand Memory Bandwidth (GB/s) (a) Specification (b) Experiment Figure 1: Memory vs Network Bandwidth: (a) specification, (b) for a Dual-socket Xeon E5v2 server with DD3-1600 and two FDR 4x NICs per socket and measured (right) total memory and network throughput for a dual-socket machine with DDR3-1600 memory and two FDR 4× NICs per socket (4 in total). This microbenchmark shows that the network transfer is indeed limited by the total available memory bandwidth, not the network bandwidth (see also Section 2 for more microbenchmarks). While these measures were done for InfiniBand, we expect that Ethernet networks will become similarly advanced [58, 7, 25]. Another important factor is that with major advances in RDMA, the network latency also improves quickly. Our recent experiments with InfiniBand FDR 4× showed that the system requires ≈ 1µs to transfer 1KB of data using RDMA, compared to ≈ 0.08µs for the CPU to read the same amount of data from memory. With only 256KB, there is virtually no difference between the access time since the bandwidth starts to dominate the transfer time. Yet, we do not argue that the network latency will become as fast as the memory latency. Instead, cache- and memory-locality will play an even more important role for small data requests (e.g., a hash-table look-up) as the system performance is no longer dominated by the network transfer time. At the same time, particularly for smaller deployments, InfiniBand is becoming more affordable. For example, a small cluster with 8 servers, 2× Xeon E5v2 CPUs per machine, 2 TB of DDR3-1600 memory, and one 2-port InfiniBand FDR 4× NIC per machine costs under $80K, with roughly $20K for the switch and NICs. In this configuration, the bandwidth for sending data across the network is close to the bandwidth of one memory channel (13.6 GB/s for network vs. 12.8 GB/s for memory). Furthermore, memory prices continue to drop, making it feasible to keep even large data sets entirely in memory with just a few machines [3], removing the disk as a bottleneck and created a more balanced system. However, it is wrong to assume that the fast network changes the cluster to a NUMA architecture because: (1) the RDMAbased memory access patterns are very different from a local memory access in a NUMA architecture; (2) the latency be- 1 arXiv:1504.01048v2 [cs.DB] 19 Dec 2015
“二节 r to cess a single (random of an RDMA NIC (RNIC).With verbs.most of the p ggeoRYCitacosimoaneat,whid a NUMA a hich is not supported with RDMA ssage RITE ist)nor a pur m sing system (data can be directly operations allow a machine to write (read)data into(fro a via ory of anot is a need to critically the ( d-add )allo Ato ally. Tw example,given the netw ided ope an in teadshomia ing qe pairs(ie send The applic ar e en in the dis the c buted er While this not RDMA that negt a Work Que nt(WQE)by a re etw e nchmarks signa th for ork emory (NAM)architecture oper why the c a wisdon o RDMA be reg networks and OLTP w DMA opera the he network b ni the d to re he d 18 gorithms for the NAM architecture (Section 5). eir WQ 2. BACKGROUND Before making a detailed why distributed DBMS ar ork tec and the th WOE sig istcs ot iBa d RDMA the c ient implicitly the 2.1 InfiniBand and RDMA That wa and communication。 the clien has competitive with Ethernet and an (1)Recent Inte 3 (DDIO)[6]. xecuted by the rNi nd (IP B and Re in the CPU L cache if the memory addre d,allo with Ethern lata is copied by the a the ation kets the ork.Whilee nee coherenc ()Finally,non-oherent that ence between the che and n not the c tween data once copi d,which is always 2
tween machines is still higher to access a single (random) byte than with today’s NUMA systems; and (3) hardwareembedded coherence mechanisms ensure data consistency in a NUMA architecture, which is not supported with RDMA. Clusters with RDMA-capable networks are most similar to a hybrid shared-memory and message-passing system: it is neither a shared-memory system (several address spaces exist) nor a pure message-passing system (data can be directly accessed via RDMA). Consequently, we believe there is a need to critically rethink the entire distributed DBMS architecture to take full advantage of the next generation of network technology. For example, given the fast network, it is no longer obvious that avoiding distributed transactions is always beneficial. Similarly, distributed storage managers and distributed execution algorithms (e.g., joins) should no longer be designed to avoid communication at all costs [48], but instead should consider the multi-core architecture and caching effects more carefully even in the distributed environment. While this is not the first attempt to leverage RDMA for databases [60, 55, 39], existing work does not fully recognize that next generation networks create an architectural 4 point. This paper makes the following contributions: • We present microbenchmarks to assess performance characteristics of one of the latest InfiniBand standards, FDR 4x (Section 2). • We present alternative architectures for a distributed in-memory DBMS over fast networks and introduce a novel Network-Attached Memory (NAM) architecture (Section 3). • We show why the common wisdom that says “2-phasecommit does not scale” no longer holds true for RDMAenabled networks and outline how OLTP workloads can take advantage of the network by using the NAM architecture. (Section 4) • We analyze the performance of distributed OLAP operations (joins and aggregations) and propose new algorithms for the NAM architecture (Section 5). 2. BACKGROUND Before making a detailed case why distributed DBMS architectures need to fundamentally change to take advantage of the next generation of network technology, we provide some background information and micro-benchmarks that showcase the characteristics of InfiniBand and RDMA. 2.1 InfiniBand and RDMA In the past, InfiniBand was a very expensive, high bandwidth, low latency network commonly found in large highperformance computing environments. However, InfiniBand has recently become cost-competitive with Ethernet and thus a viable alternative for enterprise clusters. Communication Stacks: InfiniBand offers two network communication stacks: IP over InfiniBand (IPoIB) and Remote Direct Memory Access (RDMA). IPoIB implements a classic TCP/IP stack over InfiniBand, allowing existing socket-based applications to run without modification. As with Ethernet-based networks, data is copied by the application into OS buffers, and the kernel processes the buffers by transmitting packets over the network. While providing an easy migration path from Ethernet to InfiniBand, our experiments show that IPoIB cannot fully leverage the network. On the other hand, RDMA provides a verbs API, which enable data transfers using the processing capabilities of an RDMA NIC (RNIC). With verbs, most of the processing is executed by the RNIC without OS involvement, which is essential for achieving low latencies. RDMA provides two verb communication models: onesided and two-sided. One-sided RDMA verbs (write, read, and atomic operations) are executed without involving the CPU of the remote machine. RDMA WRITE and READ operations allow a machine to write (read) data into (from) the remote memory of another machine. Atomic operations (fetch-and-add, compare-and-swap) allow remote memory to be modified atomically. Two-sided verbs (SEND and RECEIVE) enable applications to implement an RPC-based communication pattern that resembles the socket API. Unlike the first category, two-sided operations involve the CPU of the remote machine as well. RDMA Details: RDMA connections are implemented using queue pairs (i.e., send/receive queues). The application creates the queue pairs on the client and the server and the RNICs handle the state of the queue pairs. To communicate, a client creates a Work Queue Element (WQE) by specifying the verb and parameters (e.g., a remote memory location). The client puts the WQE into a send queue and informs the local RNIC via Programmed IO (PIO) to process the WQE. WQEs can be sent either signaled or unsignaled. Signaled means that the local RNIC pushes a completion event into a client’s completion queue (CQ) via a DMA write once the WQE has been processed by the remote side. For one-sided verbs, the WQEs are handled by the remote RNIC without interrupting the remote CPU using a DMA operation on the remote side (called server). However, as a caveat when using one-sided operations, a memory region must be registered to the local and remote RNIC to be accessible by DMA operations (i.e., the RNIC stores the virtual to physical page mappings of the registered region). For two-sided verbs, the server does not need to register a memory region, but it must put a RECEIVE request into its receive queue to handle a SEND request from the client. Since queue pairs process their WQEs in FIFO order, a typical pattern to reduce the overhead on the client side and to hide latency is to use selective signaling. That is, for send/receive queues of length n, the client can send n − 1 WQEs unsignaled and the n-th WQE signaled. Once the completion event (i.e., the acknowledgment message of the server) for the n-th WQE arrives, the client implicitly knows that the previous n − 1 WQEs have also been successfully processed. That way, computation and communication on the client can be efficiently overlapped without expensive synchronization mechanisms. Another interesting aspect is how RDMA operations of an RNIC interfere with operations of the CPU if data is concurrently accessed: (1) Recent Intel CPUs (Intel SandyBridge and later) provide a feature called Data Direct I/O (DDIO) [6]. With DDIO the DMA executed by the RNIC to read (write) data from (to) remote memory places the data directly in the CPU L3 cache if the memory address is resident in the cache to guarantee coherence. (2) On other systems the cache is flushed/invalidated by the DMA operation to guarantee coherence. (3) Finally, non-coherent systems leave the coherency problem to the software. These effects must be considered when designing distributed RDMAbased algorithms. Also note that this only concerns coherence between the cache and memory, not the coherence between data once copied, which is always left to the software. 2
om a)Thr re 2:Network Throughput and Latency ure 3:CPU Overhead for Ne 2.2 Micro-Benchmarks section presents m igure s that RDMA has a constant overhead or size son is that the gawQ正 InfiniBand ameR All other ope ny o ad on the server sic OFED 2.3.1 driver for the RNIC In fact.it is much and r late low-le age overhead actu with the m cy (Figure2 For th the default value of 1488B for IPoEth and 21888B fo m our e n mor nd le the Eth and RDMA write/read.In port a maxima 3. RETHINKING THE ARCHITECTURE sup A research challenges that arise for these new architecture 3.1 Architectures for Fast Networks RDMA 1/2RT 3.1.1 The Traditional Shared-Nothing Architecture the la er to the Figure 4(a)shows the classical shar -nothing (SN)ar of 8B.the (IP at over the This d per n 3).ag for small mes ssages (as cal RA Furth (IMB),the late A:however ple,a 1MB m h latency of 393us on ipolb while 24 quires that the main goal is to maximize data-locality u isthat an RDMA WRITE and a RDMA READ sizes less thar 56B w gies (eg.. f les than 256 cannot CPU Overhead:We also mea suredthe overhead (in CPU For example, even the best techniques for co distributed
1 10 100 1000 10000 32B 1KB 32KB 1MB 32MB Throughput (in MB/s) Message Size IPoEth IPoIB RDMA (All Verbs) (a) Throughput 0.1 1 10 100 1000 10000 100000 1e+06 32B 1KB 32KB 1MB 32MB Latency (in us) Message Size IPoEth IPoIB RDMA (WR,S/R) RDMA (RD) (b) Latency Figure 2: Network Throughput and Latency 2.2 Micro-Benchmarks This section presents microbenchmarks that compare the throughput and latency of: (1) a TCP/IP stack over 1Gbps Ethernet (IPoEth), (2) IPoIB, and (3) RDMA. These results inform the suggestions we make for the redesign of distributed DBMSs on InfiniBand. Experimental Setup: In our micro-benchmarks we used two machines, each with an Intel Xeon E5-2660 v2 processor and 256GB RAM. Both machines were equipped with a Mellanox Connect IB FDR 4x dualport RNIC. Each port of the RNIC has a bandwidth of 54.54Gbps (6.8GB/s) and is full-duplex. Additionally, each machine had a 1Gbps Ethernet NIC (with one port) connected to the same Ethernet switch. Each machine ran Ubuntu Server 14.04 and uses the OFED 2.3.1 driver for the RNIC. In our experiments, we used one port on the RNIC to better compare the InfiniBand results to the Ethernet results. In order to isolate low-level network properties, these microbenchmarks were executed in single-threaded mode. Throughput and Latency (Figure 2): For this experiment, we varied the message size from 32B up to 32MB to simulate the characteristics of different workloads (OLTP and OLAP) and measured the throughput and latency for IPoEth, IPoIB, and RDMA send/receive and write/read. In addition, we also measured the RDMA atomic operations, but since they only support a maximal message size of 8B and show the same latency and throughput as 8B READs, we omitted the results from the figure. While all RDMA verbs saturate the InfiniBand network bandwidth of approximately 6.8GB/s for message sizes greater than 2KB, IPoIB only achieves a maximum throughput of 3.5GB/s, despite using the same InfiniBand hardware as RDMA. Moreover, the latency of a message (i.e., 1/2 RTT) over IPoIB is also higher than for RDMA. In fact, for small message sizes, the latency of IPoIB is much closer to the latency of the 1Gbps Ethernet network (IPoEth). For example, for a message size of 8B, the latency is 20µs for IPoIB and 30µs for IPoEth while an RDMA WRITE only takes 1µs. This is because the TCP/IP stack for IPoIB has a very high CPU overhead per message for small messages (as we will show later in Figure 3). For larger message sizes (≥ 1MB), the latency of IPoIB is closer to RDMA; however, it is still a factor of 2.5× higher than for RDMA. For example, a 1MB message has a latency of 393µs on IPoIB while it has only 161µs for RDMA. An interesting result is that an RDMA WRITE and a SEND take only 1µs for message sizes less than 256B while a RDMA READ needs 2µs. This is because for WRITEs and SENDs, a payload of less than 256B can be inlined into the PIO which avoids the subsequent DMA read [41]. CPU Overhead: We also measured the overhead (in CPU cycles) per message of different communication stacks on both the client and server. Again, we vary the message sizes 0 2 4 6 8 10 32B 1KB 32KB 1MB 32MB CPU Cycles (in 10^y) Message Size IPoEth IPoIB RDMA (All Verbs) (a) Client 0 2 4 6 8 10 32B 1KB 32KB 1MB 32MB CPU Cycles (in 10^y) Message Size IPoEth IPoIB RDMA (RD,WR) RDMA (S/R) (b) Server Figure 3: CPU Overhead for Network Operations as in the previous experiment. Figure 3 shows that RDMA has a constant overhead on the client and the server side that is independent of the message size. The reason is that the costs of registering a WQE on the RNIC is independent of the message size. The actual data transfer is executed by the RNIC which acts as a coprocessor to handle the given WQE. On the client side the overhead is around 450 cycles independent of the RDMA verb used. The CPU overhead for atomic operations is actually the same. Moreover, as expected, on the server side only the RECEIVE verb causes a CPU overhead. All other verbs that are one-sided (READ/WRITE and the atomic operations) do not cause any overhead on the server side. The overhead of IPoIB is very different from that of RDMA. In fact, it is much more similar to the overhead of the classical Ethernet-based TCP/IP stack (IBoEth). The major difference to RDMA is that for IPoEth and IPoIB the per message overhead actually grows linearly with the message size once the message size exceeds the TCP window size (which was the default value of 1488B for IPoEth and 21888B for IPoIB in our experiment). Even more interesting is that for small message sizes, the per message overhead of IPoIB is even higher than for IPoEth. For example, an 8B message needs 7544 cycles for IPoEth and 13264 cycles for IPoIB. 3. RETHINKING THE ARCHITECTURE In this section, we discuss why the traditional architecture for distributed in-memory DBMSs is not optimal for many real-world workloads and then present novel alternatives for fast RDMA-enabled networks. We then discuss research challenges that arise for these new architectures. 3.1 Architectures for Fast Networks 3.1.1 The Traditional Shared-Nothing Architecture Figure 4(a) shows the classical shared-nothing (SN) architecture for distributed in-memory databases over slow networks. Here, the database state is partitioned over the main memory (RAM) of multiple nodes where each node has only direct access to the database partition located in its local RAM. Furthermore, in order to implement distributed control-flow and data-flow, nodes communicate with each other using socket-based send/receive operations. Efficient distributed query and transaction processing requires that the main goal is to maximize data-locality for a given workload by applying locality-aware partitioning schemes or by leveraging communication avoiding strategies (e.g., semi-joins). Ideally, no communication happens between the nodes. For many real-world workloads, however, network communication cannot be entirely avoided, resulting in large performance penalties for slow networks. For example, even resorting to the best techniques for copartitioning the tables [20, 46], it is not always possible to avoid expensive distributed join operations or distributed 3
access via RDMA is very different than those of a shared. 宝富富 (a)SN (IPoEth) (b)SN (IPolB) ons fromgar ove ed e beli R igure 4 Distribu (d)NAM (RDMA) riment,we f nd only one c 5.In LT over e of th strategies mightr cture by sing imgN arc end and r e).Weomit thi ture entirely f s are to ou 3.1.2 The Shared-Nothing Architecture for IPolB the RDMA SEND arives at the RNIC).which uld hav 31.4 The Network-Anached-Memory Architecture the enefiting from )e) In a NAN rlyoof base specifi 101 control-Ho mal arationheeHfret beOwTmplexityanc ave a negative impact on the ibuted transaction pro sing 3.1.3 The Distributed Shared-Memory Architecture we have to take odes and compute n odes on the e2(0)).but vious architecture,the system gains mor 2)and 3).Unfortunately. chan g inter o RDMA ost importa ory syster () te the sed SEND/RECEIVE ct memor data befor RDMA READ/WRIT Itshoud be noted that this separation of com How is no rE he temis all us over,m 3or are mance netw
Socket Send / Receive RAM (DB1) DBMS 1 RAM (DB2) DBMS 2 RAM (DB3) DBMS 3 RAM (DB4) DBMS 4 R/W R/W R/W R/W Compute + Storage Servers (a) SN (IPoEth) Socket Send/Rcv R/W R/W R/W R/W CPU CPU CPU CPU RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) Compute + Storage Servers (b) SN (IPoIB) RDMA Send/Rcv R/W R/W R/W R/W CPU CPU CPU CPU Compute + Storage Servers RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) RDMA R/W (c) SM (RDMA) Compute Servers Servers Storage RAM (Buffer) RAM (Buffer) RAM (Buffer) RDMA R/W+Atomics RDMA Send/Rcv CPU CPU CPU CPU RAM (Buffer) RDMA Send/Rcv RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) CPU CPU CPU CPU R/W R/W R/WR/W R/W R/W R/WR/W (d) NAM (RDMA) Figure 4: In-Memory Distributed Architectures transactions, causing high communication costs [48]. Furthermore, workloads change over time, which makes it even harder to find a good static partitioning scheme [22], and dynamic strategies might require moving huge amounts of data, further restricting the bandwidth for the actual work. As a result, the network not only limits the throughput of the system, but also its scalability; the more machines are added, the more of a bottleneck the network becomes. 3.1.2 The Shared-Nothing Architecture for IPoIB An easy way to migrate from the traditional shared-nothing architecture to fast networks, such as InfiniBand, is to simply use IPoIB as shown in Figure 4(b). The advantage of this architecture is that it requires almost no change of the database system itself while still benefiting from the extra bandwidth. In particular, data-flow operations that send large messages (e.g., data re-partitioning) will benefit tremendously from this change. However, as shown in Section 2, IPoIB cannot fully leverage the network. Perhaps surprisingly, for some types of operations, upgrading the network with IPoIB can actually decrease the performance. This is particularly true for control-flow operations, which require sending many of small messages. Figure 3 shows that the CPU overhead of IPoIB is above the overhead of IPoEth for small messages. In fact, as we will show in Section 4, these small differences can have a negative impact on the overall performance of distributed transaction processing. 3.1.3 The Distributed Shared-Memory Architecture Obviously, to better leverage the network we have to take advantage of RDMA. RDMA not only allows the system to fully utilize the bandwidth (see Figure 2(a)), but also reduces network latency and CPU overhead (see Figures 2(b) and 3). Unfortunately, changing an application from a socket-based message passing interface to RDMA verbs is not trivial. One possibility is to treat the cluster as a sharedmemory system (shown in Figure 4(c)) with two types of communication patterns: message passing using RDMAbased SEND/RECEIVE verbs and remote direct memory access through one-sided RDMA READ/WRITE verbs. However, as stated before, there is no cache-coherence protocol. Moreover, machines need to carefully declare the sharable memory regions a priori and connect them via queue pairs. The latter, if not used carefully, can also have a negative effect on the performance [30]. In addition, a memory access via RDMA is very different than those of a sharedmemory system. While a local memory access only keeps one copy of the data around (i.e., conceptually it moves the data from main memory to the cache of a CPU), a remote memory access creates a fully-independent copy. This has a range of implications from garbage collection, over cache/buffer management, up to consistency protocols. Thus, in order to achieve the appearance of a sharedmemory system, the software stack has to hide the differences and provide a real distributed shared-memory space. There have been recent attempts to create a distributed shared-memory architecture over RDMA [21]. However, we believe that a single abstraction for local and remote memory is the wrong approach. Databases usually want to have full control over the memory management and because virtual memory management can get in the way of any database system, we believe the same is true for shared-memory over RDMA. While we had the ambitions to validate this assumption throughout our experiment, we found only one commercial offering for IBM mainframes [5]. Instead, for our OLTP comparison, we implemented a simplified version of this architecture by essentially using a SN architecture and replacing socket communication with two-sided RDMA verbs (send and receive). We omit this architecture entirely for our OLAP comparison since two-sided RDMA verbs would have additionally added synchronization overhead to our system (i.e., an RDMA RECEIVE must be issued strictly before the RDMA SEND arrives at the RNIC), which would have simply slowed down the execution of our OLAP algorithms when compared to their NAM alternatives. 3.1.4 The Network-Attached-Memory Architecture Based on the previous considerations, we envision a new type of architecture, referred to as network-attached memory (or NAM for short) shown in Figure 4(d). In a NAM architecture, compute and storage are logically decoupled. The storage servers provide a shared distributed memory pool, which can be accessed from any compute node. However, the storage nodes are not aware of any database specific operations (e.g., joins or consistency protocols). These are implemented by the compute nodes. This logical separation helps to control the complexity and makes the system aware of the different types of main memory. Moreover, the storage nodes can take care of issues like garbage collection, data-reorganization or metadata management to find the appropriate remote-memory address of a data page. Note, that it is also still possible to physically co-locate storage nodes and compute nodes on the same machine to further improve performance. However, in contrast to the previous architecture, the system gains more control over what data is copied and how copies are synchronized. The NAM architecture has also several other advantages. Most importantly, storage nodes can be scaled independently of compute nodes. Furthermore, the NAM architecture can efficiently handle data imbalance since any node can access any remote partition without the need to re-distribute the data before. It should be noted that this separation of compute and storage is not new. However, similar existing systems all use an extended key/value like interface for the storage nodes [15, 36, 39] or are focused on the cloud [16, 2], instead of being built from scratch to leverage high performance networks like InfiniBand. Instead, we argue that the storage servers in the NAM architecture should expose an in- 4
terface that supports fine-grained byte-level memory acces mapped toexisting RDMA hardware For trol.In the future to take advantage of the act f the 3.2 Challenges and Opport betw pairs ities RDMA in the b kt and th RNIC the RDMAs asyr thu first polls the completo queue once a re sts t and met While this i sing is typic sing a e partit ir ch shuf e data ove ork.Ho tio design. O tha DBMS archi there is a ircak3aditrbuted RDMA mpilation In a ture this ntral with hig hp NAM archita cture where all nodes can a entra oned.exi any noc Therefo ing co dist compil ordinat With fas ther a bottleneck nor a single point of failure. 4.THE CASE FOR OLTP th)8 The traditional wisdor that a NAMa is that distributed tra -phase c nmit (2 imp h do not hine query RDM ing archo whi of data rom 4.1 Why 2PC does not scale be eff ently d in distrib we discuss factors that hinder the scalabi ks an ould mploy to im Manag ent the lateney of on ed to lock-ba tions is still much ralized SI 37,23.Ho the d to age lave be gener d to more tra 2 PC protocols 42 41.1 Dissec ting 2PC 5(a)s orde tively m work r ing 2 has ad/wri oper shar hing hite ithout considering th ration for that combines the c server)has read all ne Ho older) read-ti BID)which the ly tent r the nishes r ing th alue in a single Conr (T m an he Whi nable device(e.g.,an FPGA)on the load across (CI)roud-trip
terface that supports fine-grained byte-level memory access that preserves some of the underlying hardware features. For example, in Section 4 we show how the fine addressability allows us to efficiently implement concurrency control. In the future, we plan to take advantage of the fact that messages arrive in order between queue pairs. 3.2 Challenges and Opportunities Unfortunately, moving from a shared-nothing or sharedmemory system to a NAM architecture requires a redesign of the entire distributed database architecture from storage management to query processing and transaction management up to query compilation and metadata management. Transactions & Query Processing: Distributed query processing is typically implemented using a data-parallel execution scheme that leverages re-partitioning operators which shuffle data over the network. However, re-partitioning operators do not typically pay much attention to efficiently leveraging the CPU caches of individual machines in the cluster. Thus, we believe that there is a need for parallel cache-aware algorithms for query operators over RDMA. Similarly, we require new query optimization techniques for distributed in-memory database system with high-bandwidth network. As previously mentioned, existing distributed database systems assume that the network is the dominant bottleneck. Therefore existing cost-models for distributed query optimization often consider the network cost as the only cost-factor [44]. With fast networks and thus, a more balanced system, the optimizer needs to consider more factors since bottlenecks can shift from one component (e.g., CPU) to another (e.g., memory-bandwidth) [18]. Additionally, we believe that a NAM architecture requires new load-balancing schemes that implement ideas suggested for work-stealing on single-node machines [35]. For example, query operations could access a central data structure (i.e., a work queue) via one-sided RDMA verbs, which contains pointers to small portions of data to be processed by a given query. When a node is idle, it could pull data from the work queue. That way, distributed load balancing schemes can be efficiently implemented in a decentralized manner. Compared to existing distributed load balancing schemes, this avoids single bottlenecks and would allow greater scalability while also avoiding stragglers. Storage Management: Since the latency of one-sided RDMA verbs (i.e., read and write) to access remote database partitions is still much higher than for local memory accesses, we need to optimize the storage layer of a distributed database to minimize this latency. One idea in this direction is to develop complex storage access operations that combine different storage primitives in order to effectively minimize network roundtrips between compute and storage nodes. This is in contrast to existing storage managers which offer only simple read/write operations. For example, in Section 4 we present a complex storage operation for a distributed SI protocol that combines the locking and validation of the 2PC commit phase using a single RDMA atomic operation. However, for such complex operations, the memory layout must be carefully developed. Our current prototype therefore combines the lock information and the value into a single memory location. Modern RNICs, such as the Connect X4 Pro, provide a programmable device (e.g., an FPGA) on the RNIC. Thus, another idea to reduce storage access latencies is to implement complex storage operations that cannot be easily mapped to existing RDMA verbs in hardware. For example, writing data directly into a remote hash table of a storage node could be implemented completely on the RNICs in a single roundtrip without involving the CPUs of the storage nodes, hence allowing for new distributed join operations. Finally, we believe that novel techniques must be developed that allow efficient pre-fetching using RDMA. The idea is that the storage manager issues RDMA requests (e.g., RDMA READs) for memory regions that are likely to be accessed next and the RNIC processes them asynchronously in the background. Moreover, the RDMA storage manager thus first polls the completion queue once a requests for a remote memory address shall be executed to check if the remote memory has already been prefetched. While this is straightforward for sequential scanning of table partitions, index structures, which often rely on random access, require a more careful design. Metadata Management and Query Compilation: Typically, a distributed DBMS architecture has one central node which is responsible for metadata management and query compilation. In a classical architecture this central node can either fail or become a bottleneck under heavy loads. In a NAM architecture where all nodes can access central data structures using remote memory accesses, any node can read and update the metadata. Therefore, any node can compile queries and coordinate their execution. Thus, query compilation and metadata management exists as neither a bottleneck nor a single point of failure. 4. THE CASE FOR OLTP The traditional wisdom is that distributed transactions, particularly when using 2-phase commit (2PC), do not scale [59, 31, 57, 19, 45, 53]. In this section, we show that this is the case on a shared-nothing architecture over slow networks and then present a novel protocol for the NAM architecture that can take full advantage of the network and, theoretically, removes the scalability limit. 4.1 Why 2PC does not scale In this section, we discuss factors that hinder the scalability of distributed transaction processing over slow networks. Modern DBMSs employ Snapshot Isolation (SI) to implement concurrency control and isolation because it promises superior query performance compared to lock-based alternatives. The discussion in this section is based on a 2PC protocol for generalized SI [37, 23]. However, the findings can also be generalized to more traditional 2PC protocols [42]. 4.1.1 Dissecting 2PC Figure 5(a) shows a traditional (simplified) protocol using 2-phase commit with generalized SI guarantees [37, 23], assuming that the data is partitioned across the nodes (i.e., shared-nothing architecture) and without considering the read-phase (see also [15, 17, 49]). That is, we assume that the client (e.g., application server) has read all necessary records to issue the full transaction using a (potentially older) read-timestamp (RID), which guarantees a consistent view of the data. After the client finishes reading the records, it sends the commit request to the transaction manager (TM) [one-way message 1]. While Figure 5(a) only shows one TM, there can be more, evenly distributing the load across nodes. As a next step, the TM requests a commit timestamp (CID) [round-trip message 2]. In this paper, we assume 5