An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado Javier Cabezas John E.Stone Sanjay Patel Nacho Navarro Wen-mei W.Hwu Universitat Politecnica de Catalunya University of Illinois igelado,jcabezas,nacho@ac.upc.edu {jestone,sjp,hwu}oillinois.edu Abstract 1.Introduction Heterogeneous computing combines general purpose CPUs with Maximizing multi-thread throughput and minimizing single-thread accelerators to efficiently execute both sequential control-inten- latency are two design goals that impose very different and often sive and data-parallel phases of applications.Existing program- conflicting requirements on processor design.For example,the ming models for heterogeneous computing rely on programmers to Intel Xeon E7450 [28]processor consists of six processor cores explicitly manage data transfers between the CPU system memory each of which is an high-frequency out-of-order,multi-instruction- and accelerator memory. issue processor with a sophisticated branch prediction mechanism This paper presents a new programming model for heteroge- to achieve short single-thread execution latency.This is in contrast neous computing,called Asymmetric Distributed Shared Memory to the NVIDIA Tesla GT200 Graphics Processing Unit(GPU)[35] (ADSM),that maintains a shared logical memory space for CPUs design that achieves high multi-thread throughput with many cores, to access objects in the accelerator physical memory but not vice each of which is a moderate-frequency,multi-threaded,in-order versa.The asymmetry allows light-weight implementations that processor that shares its control unit and instruction cache with avoid common pitfalls of symmetrical distributed shared memory seven other cores.For control intensive code,the E7450 design can systems.ADSM allows programmers to assign data objects to per- easily outperform the NVIDIA Tesla.For massively data parallel formance critical methods.When a method is selected for acceler- applications,the NVIDIA Tesla design can easily achieve higher ator execution,its associated data objects are allocated within the performance than the E7450. shared logical memory space,which is hosted in the accelerator Data parallel code has the property that multiple instances of physical memory and transparently accessible by the methods exe- the code can be executed concurrently on different data.Data par- cuted on CPUs. allelism exists in many applications such as physics simulation, We argue that ADSM reduces programming efforts for hetero- weather prediction,financial analysis,medical imaging,and me- geneous computing systems and enhances application portability dia processing.Most of these applications also have control-inten- We present a software implementation of ADSM.called GMAC. sive phases that are often interleaved between data-parallel phases on top of CUDA in a GNU/Linux environment.We show that ap- Hence,general purpose CPUs and accelerators can be combined to plications written in ADSM and running on top of GMAC achieve form heterogeneous parallel computing systems that efficiently ex- performance comparable to their counterparts using programmer- ecute all application phases [39].There are many examples of suc- managed data transfers.This paper presents the GMAC system and cessful heterogeneous systems.For example,the RoadRunner su- evaluates different design choices.We further suggest additional ar- percomputer couples AMD Opteron processors with IBM PowerX- chitectural support that will likely allow GMAC to achieve higher Cell accelerators.If the RoadRunner supercomputer were bench- application performance than the current CUDA model. marked using only its general purpose CPUs,rather than being the top-ranked system in the Top500 List (June 2008),it would drop to Categories and Subject Descriptors D.4.2 [Operating Systems]: a 50th-fastest ranking [6. Storage Management-Distributed Memories Current commercial programming models of processor-accel- erator data transfer are based on Direct Memory Access (DMA) hardware,which is typically exposed to applications programmers General Terms Design,Experimentation,Performance through memory copy routines.For example,in the CUDA pro- gramming model [38],an application programmer can transfer data Keywords Heterogeneous Systems,Data-centric Programming from the processor to the accelerator device by calling a memory Models,Asymmetric Distributed Shared Memory copy routine whose input parameter includes a source data pointer to the processor memory space,a destination pointer to the accel- erator memory space,and the number of bytes to be copied.The memory copy interface ensures that the accelerator can only access the part of the application data that is explicitly requested by the Permission to make digital or hard copies of all or part of this work for personal or memory copy parameters. classroom use is granted without fee provided that copies are not made or distributed In this paper we argue that heterogeneous systems benefit from for profit or commercial advantage and that copies bear this notice and the full citation a data-centric programming model where programmers assign data on the first page.To copy otherwise,to republish,to post on servers or to redistribute objects to performance critical methods.This model provides the to lists,requires prior specific permission and/or a fee run-time system with enough information to automatically trans- ASPLOS'10,March 13-17,2010.Pittsburgh,Pennsylvania,USA Copyright©2010ACM978-1-60558-839-1/10W03..s10.00 fer data between general purpose CPUs and accelerators.Such a 347
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado Javier Cabezas Nacho Navarro Universitat Politecnica de Catalunya {igelado, jcabezas, nacho}@ac.upc.edu John E. Stone Sanjay Patel Wen-mei W. Hwu University of Illinois {jestone, sjp, hwu}@illinois.edu Abstract Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmermanaged data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model. Categories and Subject Descriptors D.4.2 [Operating Systems]: Storage Management—Distributed Memories General Terms Design, Experimentation, Performance Keywords Heterogeneous Systems, Data-centric Programming Models, Asymmetric Distributed Shared Memory Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’10, March 13–17, 2010, Pittsburgh, Pennsylvania, USA. Copyright c 2010 ACM 978-1-60558-839-1/10/03. . . $10.00 1. Introduction Maximizing multi-thread throughput and minimizing single-thread latency are two design goals that impose very different and often conflicting requirements on processor design. For example, the Intel Xeon E7450 [28] processor consists of six processor cores each of which is an high-frequency out-of-order, multi-instructionissue processor with a sophisticated branch prediction mechanism to achieve short single-thread execution latency. This is in contrast to the NVIDIA Tesla GT200 Graphics Processing Unit (GPU) [35] design that achieves high multi-thread throughput with many cores, each of which is a moderate-frequency, multi-threaded, in-order processor that shares its control unit and instruction cache with seven other cores. For control intensive code, the E7450 design can easily outperform the NVIDIA Tesla. For massively data parallel applications, the NVIDIA Tesla design can easily achieve higher performance than the E7450. Data parallel code has the property that multiple instances of the code can be executed concurrently on different data. Data parallelism exists in many applications such as physics simulation, weather prediction, financial analysis, medical imaging, and media processing. Most of these applications also have control-intensive phases that are often interleaved between data-parallel phases. Hence, general purpose CPUs and accelerators can be combined to form heterogeneous parallel computing systems that efficiently execute all application phases [39]. There are many examples of successful heterogeneous systems. For example, the RoadRunner supercomputer couples AMD Opteron processors with IBM PowerXCell accelerators. If the RoadRunner supercomputer were benchmarked using only its general purpose CPUs, rather than being the top-ranked system in the Top500 List (June 2008), it would drop to a 50th-fastest ranking [6]. Current commercial programming models of processor-accelerator data transfer are based on Direct Memory Access (DMA) hardware, which is typically exposed to applications programmers through memory copy routines. For example, in the CUDA programming model [38], an application programmer can transfer data from the processor to the accelerator device by calling a memory copy routine whose input parameter includes a source data pointer to the processor memory space, a destination pointer to the accelerator memory space, and the number of bytes to be copied. The memory copy interface ensures that the accelerator can only access the part of the application data that is explicitly requested by the memory copy parameters. In this paper we argue that heterogeneous systems benefit from a data-centric programming model where programmers assign data objects to performance critical methods. This model provides the run-time system with enough information to automatically transfer data between general purpose CPUs and accelerators. Such a 347
run-time system improves programmability and compatibility of heterogeneous systems.This paper introduces the Asymmetric Dis- tributed Shared Memory (ADSM)model,a data-centric program- ming model,that maintains a shared logical memory space for RAM Memory CPU CPUs to access objects in the accelerator physical memory but not cores RAM Memory vice versa.This asymmetry allows all coherence and consistency ache Memory actions to be executed on the CPU,allowing the use of simple ac- RAM Memory celerators.This paper also presents GMAC,a user-level ADSM library,and discusses design and implementation details of such a system.Experimental results using GMAC show that an ADSM 10 PCIe Accelerator system makes heterogeneous systems easier to program without in- HUB troducing performance penalties. The main contributions of this paper are:(1)the introduction of ADSM as a data-centric programming model for heterogeneous systems.The benefits of this model are architecture independence Figure 1.Reference Architecture,similar to desktop GPUs and legacy support,and efficient I/O support;(2)a detailed discussion RoadRunner blades about the design of an ADSM system,which includes the definition of the necessary API calls and the description of memory coherence and consistency required by an ADSM system;(3)a description consistency models implemented by accelerators allow memory of the software techniques required to build an ADSM system for controllers to serve several requests in a single memory access current accelerators on top of existing operating systems;(4)an Strong consistency models required by general purpose CPUs do analysis of different coherence protocols that can be implemented not offer the same freedom to rearrange accesses to system mem- in an ADSM system. ory.Memory access scheduling in the memory controller has differ- This paper is organized as follows.Section 2 presents the neces- ent requirements for general purpose CPUs and accelerators(i.e., sary background and motivates this work.ADSM is presented as a latency vs throughput).Virtual memory management also tends data-centric programming model in Section 3,which also discusses to be quite different on CPUs and accelerators (e.g.,GPUs tend the benefit of ADSM for heterogeneous systems and presents the to benefit more from large page sizes than CPUs),which makes API,consistency model,and different coherence protocols for the design of TLBs and MMUs quite different (e.g.,incompatible ADSM systems.The design and implementation of an ADSM memory page sizes).Hence,general purpose CPUs and acceler- system,GMAC,is presented in Section 4.Section 5 presents ex- ators are connected to separate memories in most heterogeneous perimental results.ADSM is compared to other work in Section 6. systems,as shown in Figure 1.Many such examples of hetero Finally,Section 7 concludes this paper. geneous systems currently exist.The NVIDIA GeForce graphics card [35]includes its own GDDR memory (up to 4GB)and is at- tached to the CPU through a PCle bus.Future graphics cards based 2. Background and Motivation on the Intel Larrabee [40]chip will have a similar configuration. 2.1 Background The Roadrunner supercomputer is composed of nodes that include two AMD Opteron CPUs (IBM BladeCenter LS21)and four Pow- General purpose CPUs and accelerators can be coupled in many erXCell chips(2x IBM BladeCenter QS22).Each LS21 BladeCen- different ways.Fine-grained accelerators are usually attached as ter is connected to two QS22 BladeCenters through a PCle bus. functional units inside the processor pipeline [21,22.41,43].The constraining processors to access only on-board memory [6].In Xilinx Virtex 5 FXT FPGAs include a PowerPC 440 connected this paper we assume a base heterogeneous system in Figure 1. to reconfigurable logic by a crossbar [46].In the Cell BE chip. However,the concepts developed in this paper are equally applica- the Synergistic Processing Units,L2 cache controller,the memory ble to systems where general purpose CPUs and accelerators share interface controller.and the bus interface controller are connected the same physical memory. through an Element Interconnect Bus [30].The Intel Graphics Media Accelerator is integrated inside the Graphics and Memory 2.2 Motivation Controller Hub that manages the flow of information between the Heterogeneous parallel computing improves application perfor- processor,the system memory interface,the graphics interface,and mance by executing computationally intensive data-parallel kernels the I/O controller [271.AMD Fusion chips will integrate CPU. on accelerators designed to maximize data throughput,while exe- memory controller,GPU,and PCIe Controller into a single chip.A cuting the control-intensive code on general purpose CPUs.Hence, common characteristic among Virtex 5,Cell BE,Graphics Media some data structures are likely to be accessed primarily by the code Accelerator,and AMD Fusion is that general purpose CPUs and executed by accelerators.For instance,execution traces show that accelerators share access to system memory.In these systems,the about 99%of read and write accesses to the main data structures in system memory controller deals with memory requests coming the NASA Parallel Benchmarks (NPB)occur inside computation- from both general purpose CPUs and accelerators. ally intensive kernels that are amenable for parallelization. Accelerators and general purpose CPUs impose very differ- Figure 2 shows our estimation for the average memory band- ent requirements on the system memory controller.General pur- width requirements for the computationally intensive kernels of pose CPUs are designed to minimize the instruction latency and some NPB benchmarks,assuming a 800MHz clock frequency for typically implement some form of strong memory consistency different values of IPC and illustrates the need to store the data (e.g.,sequential consistency in MIPS processors).Accelerators structures required by accelerators in their own memories.For in- are designed to maximize data throughput and implement weak stance,if all data accesses are done through a PCle bus',the max- forms of memory consistency (e.g.Rigel implements weak consis- imum achievable value of IPC is 50 for bt and 5 for ua,which tency [32)).Memory controllers for general purpose CPUs tend to implement narrow memory buses (e.g.192 bits for the Intel Core 1 If both accelerator and CPU share the same memory controller,the avail- i7)compared to data parallel accelerators (e.g.512 bits for the able accelerator bandwidth will be similar to HyperTransport in Figure 2, NVIDIA GTX280)to minimize the memory access time.Relaxed which also limits the maximum achievable value of IPC. 348
run-time system improves programmability and compatibility of heterogeneous systems. This paper introduces the Asymmetric Distributed Shared Memory (ADSM) model, a data-centric programming model, that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. This asymmetry allows all coherence and consistency actions to be executed on the CPU, allowing the use of simple accelerators. This paper also presents GMAC, a user-level ADSM library, and discusses design and implementation details of such a system. Experimental results using GMAC show that an ADSM system makes heterogeneous systems easier to program without introducing performance penalties. The main contributions of this paper are: (1) the introduction of ADSM as a data-centric programming model for heterogeneous systems. The benefits of this model are architecture independence, legacy support, and efficient I/O support; (2) a detailed discussion about the design of an ADSM system, which includes the definition of the necessary API calls and the description of memory coherence and consistency required by an ADSM system; (3) a description of the software techniques required to build an ADSM system for current accelerators on top of existing operating systems; (4) an analysis of different coherence protocols that can be implemented in an ADSM system. This paper is organized as follows. Section 2 presents the necessary background and motivates this work. ADSM is presented as a data-centric programming model in Section 3, which also discusses the benefit of ADSM for heterogeneous systems and presents the API, consistency model, and different coherence protocols for ADSM systems. The design and implementation of an ADSM system, GMAC, is presented in Section 4. Section 5 presents experimental results. ADSM is compared to other work in Section 6. Finally, Section 7 concludes this paper. 2. Background and Motivation 2.1 Background General purpose CPUs and accelerators can be coupled in many different ways. Fine-grained accelerators are usually attached as functional units inside the processor pipeline [21, 22, 41, 43]. The Xilinx Virtex 5 FXT FPGAs include a PowerPC 440 connected to reconfigurable logic by a crossbar [46]. In the Cell BE chip, the Synergistic Processing Units, L2 cache controller, the memory interface controller, and the bus interface controller are connected through an Element Interconnect Bus [30]. The Intel Graphics Media Accelerator is integrated inside the Graphics and Memory Controller Hub that manages the flow of information between the processor, the system memory interface, the graphics interface, and the I/O controller [27]. AMD Fusion chips will integrate CPU, memory controller, GPU, and PCIe Controller into a single chip. A common characteristic among Virtex 5, Cell BE, Graphics Media Accelerator, and AMD Fusion is that general purpose CPUs and accelerators share access to system memory. In these systems, the system memory controller deals with memory requests coming from both general purpose CPUs and accelerators. Accelerators and general purpose CPUs impose very different requirements on the system memory controller. General purpose CPUs are designed to minimize the instruction latency and typically implement some form of strong memory consistency (e.g., sequential consistency in MIPS processors). Accelerators are designed to maximize data throughput and implement weak forms of memory consistency (e.g. Rigel implements weak consistency [32]). Memory controllers for general purpose CPUs tend to implement narrow memory buses (e.g. 192 bits for the Intel Core i7) compared to data parallel accelerators (e.g. 512 bits for the NVIDIA GTX280) to minimize the memory access time. Relaxed Figure 1. Reference Architecture, similar to desktop GPUs and RoadRunner blades consistency models implemented by accelerators allow memory controllers to serve several requests in a single memory access. Strong consistency models required by general purpose CPUs do not offer the same freedom to rearrange accesses to system memory. Memory access scheduling in the memory controller has different requirements for general purpose CPUs and accelerators (i.e., latency vs throughput). Virtual memory management also tends to be quite different on CPUs and accelerators (e.g., GPUs tend to benefit more from large page sizes than CPUs), which makes the design of TLBs and MMUs quite different (e.g., incompatible memory page sizes). Hence, general purpose CPUs and accelerators are connected to separate memories in most heterogeneous systems, as shown in Figure 1. Many such examples of heterogeneous systems currently exist. The NVIDIA GeForce graphics card [35] includes its own GDDR memory (up to 4GB) and is attached to the CPU through a PCIe bus. Future graphics cards based on the Intel Larrabee [40] chip will have a similar configuration. The Roadrunner supercomputer is composed of nodes that include two AMD Opteron CPUs (IBM BladeCenter LS21) and four PowerXCell chips (2x IBM BladeCenter QS22). Each LS21 BladeCenter is connected to two QS22 BladeCenters through a PCIe bus, constraining processors to access only on-board memory [6]. In this paper we assume a base heterogeneous system in Figure 1. However, the concepts developed in this paper are equally applicable to systems where general purpose CPUs and accelerators share the same physical memory. 2.2 Motivation Heterogeneous parallel computing improves application performance by executing computationally intensive data-parallel kernels on accelerators designed to maximize data throughput, while executing the control-intensive code on general purpose CPUs. Hence, some data structures are likely to be accessed primarily by the code executed by accelerators. For instance, execution traces show that about 99% of read and write accesses to the main data structures in the NASA Parallel Benchmarks (NPB) occur inside computationally intensive kernels that are amenable for parallelization. Figure 2 shows our estimation for the average memory bandwidth requirements for the computationally intensive kernels of some NPB benchmarks, assuming a 800MHz clock frequency for different values of IPC and illustrates the need to store the data structures required by accelerators in their own memories. For instance, if all data accesses are done through a PCIe bus1 , the maximum achievable value of IPC is 50 for bt and 5 for ua, which 1 If both accelerator and CPU share the same memory controller, the available accelerator bandwidth will be similar to HyperTransport in Figure 2, which also limits the maximum achievable value of IPC. 348
NVIDIA GTX295 Mamory ,int a) floafoo,dev_too foD ollocisize); Duplicated Pointers fread foo ize,1rf面 cudaMalloc sdev_foo size Explicit Consistency kernele<<Dg.Dbs>(dev_foo ize Management cudaFred(dov_foo free(Tao) Figure 3.Example code with duplicated pointers and explicit con- sistency management Figure 2.Estimated bandwidth requirements for computationally intensive kernels of bt,ep,lu,mg.ua benchmarks,assuming a 800MHz clock frequency executed on the accelerator.OpenCL and Roadrunner codes do not significantly differ from code in Figure 3. Such programming models ensure that data structures reside in is a very small fraction of the peak execution rate of the NVIDIA the memory of the processor (CPU or accelerator),that performs GTX295 GPU.In all cases,the level of IPC that can be supported subsequent computations.These models also imply that program- by the GTX295 memory bandwidth is much higher than the sup- mers must explicitly request memory on different processors and ported by PCIe or similar interconnect schemes.In order to achieve thus,a data structure (foo in Figure 3)is referenced by two differ- optimal system performance,it is crucial to host data structures ac- ent memory addresses:foo,a virtual address in system memory cessed by computationally intensive kernels in on-board accelerator and dev_foo,a physical address in the accelerator memory.Pro- memories. grammers must explicitly manage memory coherence (e.g.,with a These results discourage the implementation of fully-coherent call to cudaMemcpy())before executing kernels on the accelerator. heterogeneous system due to the high number of coherence re- This approach also prevents parameters from being passed by refer- quests produced by accelerators during kernel execution (e.g.,in- ence to accelerator kernels [19]and computationally critical meth- validation requests the first time data initialized by the CPU is ac- ods to return pointers to the output data structures instead of return- cessed by accelerators).Moreover,a fully coherent heterogeneous ing the whole output data structure,which would save bandwidth system would require both.CPUs and accelerators to implement whenever the code at CPU only requires accessing a small portion the very same coherence protocol.Hence,it would be difficult,if of the returned data structure.These approaches harm portability not infeasible,to use the same accelerator (e.g.,a GPU)in sys- because they expose data transfer details of the underlaying hard- tems based on different CPU architectures,which would impose a ware.Offering a programming interface that requires a single allo- significant economic penalty on accelerator manufactures.Finally, cation call and removes the need for explicit data transfers would the logic required to implement the coherence protocol in the ac- increase programmability and portability of heterogeneous systems celerator would consume a large silicon area,currently devoted to and is the first motivation of this paper. processing units,which would decrease the benefit of using accel- The cost of data transfers between general purpose CPUs and erators. accelerators might eliminate the benefit of using accelerators.Dou- The capacity of on-board accelerator memories is growing and ble buffering can help to alleviate this situation by transferring parts currently allows for many data structures to be hosted by accel- of the data structure while other parts are still in use.In the ex- erators memories.Current GPUs use 32-bit physical memory ad- ample of Figure 3,the input data would be read iteratively using dresses and include up to 4GB of memory and soon GPU architec- a call to fread()followed by an asynchronous DMA transfer to tures will move to larger physical addresses (e.g..40-bit in NVIDIA the accelerator memory.Synchronization code is necessary to pre- Fermi)to support larger memory capacities.IBM QS20 and QS21 vent overwriting system memory that is still in use by an ongoing the first systems based on Cell BE,included 512MB and 1GB of DMA transfer [19].The coding effort to reduce the cost of data main memory per chip respectively.IBM QS22,the latest Cell- transfer harms programmability of heterogeneous systems.Auto- based system,supports up to 16GB of main memory per chip.IBM matically overlapping data transfers and CPU computation without QS22 is based on the PowerXCell 8i chip,which is an evolution of code modifications is the second motivation of this paper. the original Cell BE chip,modified to support a larger main mem- ory capacity [6].These two examples illustrate the current trend 3. Asymmetric Distributed Shared Memory that allows increasingly larger data structures to be hosted by ac- celerators and justifies our ADSM design. Asymmetric Distributed Shared Memory (ADSM)maintains a Programming models for current heterogeneous parallel sys- shared logical memory space for CPUs to access objects in the ac- tems,such as NVIDIA CUDA [11]and OpenCL [1],present dif- celerator physical memory but not vice versa.This section presents ferent memories in the system as distinct memory spaces to the ADSM as a data-centric programming model and the benefit of an programmer.Applications explicitly request memory from a given asymmetric shared address space memory space (i.e.cudaMalloc())and perform data transfers be- tween different memory spaces (i.e.cudaMemcpy ())The exam 3.1 ADSM as a Data-Centric Programming Model ple in Figure 3 illustrates this situation.First,system memory is In a data-centric programming model,programmers allocate or de- allocated (malloc())and initialized (fread()).Then,accelera- clare data objects that are processed by methods,and annotate per tor memory is allocated (cudaMalloc())and the data structure is formance critical methods (kernels)that are executed by accelera- copied to the accelerator memory (cudaMemcpy()),before code is tors.When such methods are assigned to an accelerator,their cor- 349
100 MBps 1 GBps 10 GBps 100 GBps 0 20 40 60 80 100 PCIe QPI HyperTransport NVIDIA GTX295 Memory Bandwidth IPC bt ep lu mg ua Figure 2. Estimated bandwidth requirements for computationally intensive kernels of bt, ep, lu, mg, ua benchmarks, assuming a 800MHz clock frequency is a very small fraction of the peak execution rate of the NVIDIA GTX295 GPU. In all cases, the level of IPC that can be supported by the GTX295 memory bandwidth is much higher than the supported by PCIe or similar interconnect schemes. In order to achieve optimal system performance, it is crucial to host data structures accessed by computationally intensive kernels in on-board accelerator memories. These results discourage the implementation of fully-coherent heterogeneous system due to the high number of coherence requests produced by accelerators during kernel execution (e.g., invalidation requests the first time data initialized by the CPU is accessed by accelerators). Moreover, a fully coherent heterogeneous system would require both, CPUs and accelerators to implement the very same coherence protocol. Hence, it would be difficult, if not infeasible, to use the same accelerator (e.g., a GPU) in systems based on different CPU architectures, which would impose a significant economic penalty on accelerator manufactures. Finally, the logic required to implement the coherence protocol in the accelerator would consume a large silicon area, currently devoted to processing units, which would decrease the benefit of using accelerators. The capacity of on-board accelerator memories is growing and currently allows for many data structures to be hosted by accelerators memories. Current GPUs use 32-bit physical memory addresses and include up to 4GB of memory and soon GPU architectures will move to larger physical addresses (e.g., 40-bit in NVIDIA Fermi) to support larger memory capacities. IBM QS20 and QS21, the first systems based on Cell BE, included 512MB and 1GB of main memory per chip respectively. IBM QS22, the latest Cellbased system, supports up to 16GB of main memory per chip. IBM QS22 is based on the PowerXCell 8i chip, which is an evolution of the original Cell BE chip, modified to support a larger main memory capacity [6]. These two examples illustrate the current trend that allows increasingly larger data structures to be hosted by accelerators and justifies our ADSM design. Programming models for current heterogeneous parallel systems, such as NVIDIA CUDA [11] and OpenCL [1], present different memories in the system as distinct memory spaces to the programmer. Applications explicitly request memory from a given memory space (i.e. cudaMalloc()) and perform data transfers between different memory spaces (i.e. cudaMemcpy()). The example in Figure 3 illustrates this situation. First, system memory is allocated (malloc()) and initialized (fread()). Then, accelerator memory is allocated (cudaMalloc()) and the data structure is copied to the accelerator memory (cudaMemcpy()), before code is Figure 3. Example code with duplicated pointers and explicit consistency management executed on the accelerator. OpenCL and Roadrunner codes do not significantly differ from code in Figure 3. Such programming models ensure that data structures reside in the memory of the processor (CPU or accelerator), that performs subsequent computations. These models also imply that programmers must explicitly request memory on different processors and, thus, a data structure (foo in Figure 3) is referenced by two different memory addresses: foo, a virtual address in system memory, and dev foo, a physical address in the accelerator memory. Programmers must explicitly manage memory coherence (e.g., with a call to cudaMemcpy()) before executing kernels on the accelerator. This approach also prevents parameters from being passed by reference to accelerator kernels [19] and computationally critical methods to return pointers to the output data structures instead of returning the whole output data structure, which would save bandwidth whenever the code at CPU only requires accessing a small portion of the returned data structure. These approaches harm portability because they expose data transfer details of the underlaying hardware. Offering a programming interface that requires a single allocation call and removes the need for explicit data transfers would increase programmability and portability of heterogeneous systems and is the first motivation of this paper. The cost of data transfers between general purpose CPUs and accelerators might eliminate the benefit of using accelerators. Double buffering can help to alleviate this situation by transferring parts of the data structure while other parts are still in use. In the example of Figure 3, the input data would be read iteratively using a call to fread() followed by an asynchronous DMA transfer to the accelerator memory. Synchronization code is necessary to prevent overwriting system memory that is still in use by an ongoing DMA transfer [19]. The coding effort to reduce the cost of data transfer harms programmability of heterogeneous systems. Automatically overlapping data transfers and CPU computation without code modifications is the second motivation of this paper. 3. Asymmetric Distributed Shared Memory Asymmetric Distributed Shared Memory (ADSM) maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. This section presents ADSM as a data-centric programming model and the benefit of an asymmetric shared address space. 3.1 ADSM as a Data-Centric Programming Model In a data-centric programming model, programmers allocate or declare data objects that are processed by methods, and annotate performance critical methods (kernels) that are executed by accelerators. When such methods are assigned to an accelerator, their cor- 349