NVIDIA GTX295 Mamory ,int a) floafoo,dev_too foD ollocisize); Duplicated Pointers fread foo ize,1rf面 cudaMalloc sdev_foo size Explicit Consistency kernele<<Dg.Dbs>(dev_foo ize Management cudaFred(dov_foo free(Tao) Figure 3.Example code with duplicated pointers and explicit con- sistency management Figure 2.Estimated bandwidth requirements for computationally intensive kernels of bt,ep,lu,mg.ua benchmarks,assuming a 800MHz clock frequency executed on the accelerator.OpenCL and Roadrunner codes do not significantly differ from code in Figure 3. Such programming models ensure that data structures reside in is a very small fraction of the peak execution rate of the NVIDIA the memory of the processor (CPU or accelerator),that performs GTX295 GPU.In all cases,the level of IPC that can be supported subsequent computations.These models also imply that program- by the GTX295 memory bandwidth is much higher than the sup- mers must explicitly request memory on different processors and ported by PCIe or similar interconnect schemes.In order to achieve thus,a data structure (foo in Figure 3)is referenced by two differ- optimal system performance,it is crucial to host data structures ac- ent memory addresses:foo,a virtual address in system memory cessed by computationally intensive kernels in on-board accelerator and dev_foo,a physical address in the accelerator memory.Pro- memories. grammers must explicitly manage memory coherence (e.g.,with a These results discourage the implementation of fully-coherent call to cudaMemcpy())before executing kernels on the accelerator. heterogeneous system due to the high number of coherence re- This approach also prevents parameters from being passed by refer- quests produced by accelerators during kernel execution (e.g.,in- ence to accelerator kernels [19]and computationally critical meth- validation requests the first time data initialized by the CPU is ac- ods to return pointers to the output data structures instead of return- cessed by accelerators).Moreover,a fully coherent heterogeneous ing the whole output data structure,which would save bandwidth system would require both.CPUs and accelerators to implement whenever the code at CPU only requires accessing a small portion the very same coherence protocol.Hence,it would be difficult,if of the returned data structure.These approaches harm portability not infeasible,to use the same accelerator (e.g.,a GPU)in sys- because they expose data transfer details of the underlaying hard- tems based on different CPU architectures,which would impose a ware.Offering a programming interface that requires a single allo- significant economic penalty on accelerator manufactures.Finally, cation call and removes the need for explicit data transfers would the logic required to implement the coherence protocol in the ac- increase programmability and portability of heterogeneous systems celerator would consume a large silicon area,currently devoted to and is the first motivation of this paper. processing units,which would decrease the benefit of using accel- The cost of data transfers between general purpose CPUs and erators. accelerators might eliminate the benefit of using accelerators.Dou- The capacity of on-board accelerator memories is growing and ble buffering can help to alleviate this situation by transferring parts currently allows for many data structures to be hosted by accel- of the data structure while other parts are still in use.In the ex- erators memories.Current GPUs use 32-bit physical memory ad- ample of Figure 3,the input data would be read iteratively using dresses and include up to 4GB of memory and soon GPU architec- a call to fread()followed by an asynchronous DMA transfer to tures will move to larger physical addresses (e.g..40-bit in NVIDIA the accelerator memory.Synchronization code is necessary to pre- Fermi)to support larger memory capacities.IBM QS20 and QS21 vent overwriting system memory that is still in use by an ongoing the first systems based on Cell BE,included 512MB and 1GB of DMA transfer [19].The coding effort to reduce the cost of data main memory per chip respectively.IBM QS22,the latest Cell- transfer harms programmability of heterogeneous systems.Auto- based system,supports up to 16GB of main memory per chip.IBM matically overlapping data transfers and CPU computation without QS22 is based on the PowerXCell 8i chip,which is an evolution of code modifications is the second motivation of this paper. the original Cell BE chip,modified to support a larger main mem- ory capacity [6].These two examples illustrate the current trend 3. Asymmetric Distributed Shared Memory that allows increasingly larger data structures to be hosted by ac- celerators and justifies our ADSM design. Asymmetric Distributed Shared Memory (ADSM)maintains a Programming models for current heterogeneous parallel sys- shared logical memory space for CPUs to access objects in the ac- tems,such as NVIDIA CUDA [11]and OpenCL [1],present dif- celerator physical memory but not vice versa.This section presents ferent memories in the system as distinct memory spaces to the ADSM as a data-centric programming model and the benefit of an programmer.Applications explicitly request memory from a given asymmetric shared address space memory space (i.e.cudaMalloc())and perform data transfers be- tween different memory spaces (i.e.cudaMemcpy ())The exam 3.1 ADSM as a Data-Centric Programming Model ple in Figure 3 illustrates this situation.First,system memory is In a data-centric programming model,programmers allocate or de- allocated (malloc())and initialized (fread()).Then,accelera- clare data objects that are processed by methods,and annotate per tor memory is allocated (cudaMalloc())and the data structure is formance critical methods (kernels)that are executed by accelera- copied to the accelerator memory (cudaMemcpy()),before code is tors.When such methods are assigned to an accelerator,their cor- 349100 MBps 1 GBps 10 GBps 100 GBps 0 20 40 60 80 100 PCIe QPI HyperTransport NVIDIA GTX295 Memory Bandwidth IPC bt ep lu mg ua Figure 2. Estimated bandwidth requirements for computationally intensive kernels of bt, ep, lu, mg, ua benchmarks, assuming a 800MHz clock frequency is a very small fraction of the peak execution rate of the NVIDIA GTX295 GPU. In all cases, the level of IPC that can be supported by the GTX295 memory bandwidth is much higher than the sup￾ported by PCIe or similar interconnect schemes. In order to achieve optimal system performance, it is crucial to host data structures ac￾cessed by computationally intensive kernels in on-board accelerator memories. These results discourage the implementation of fully-coherent heterogeneous system due to the high number of coherence re￾quests produced by accelerators during kernel execution (e.g., in￾validation requests the first time data initialized by the CPU is ac￾cessed by accelerators). Moreover, a fully coherent heterogeneous system would require both, CPUs and accelerators to implement the very same coherence protocol. Hence, it would be difficult, if not infeasible, to use the same accelerator (e.g., a GPU) in sys￾tems based on different CPU architectures, which would impose a significant economic penalty on accelerator manufactures. Finally, the logic required to implement the coherence protocol in the ac￾celerator would consume a large silicon area, currently devoted to processing units, which would decrease the benefit of using accel￾erators. The capacity of on-board accelerator memories is growing and currently allows for many data structures to be hosted by accel￾erators memories. Current GPUs use 32-bit physical memory ad￾dresses and include up to 4GB of memory and soon GPU architec￾tures will move to larger physical addresses (e.g., 40-bit in NVIDIA Fermi) to support larger memory capacities. IBM QS20 and QS21, the first systems based on Cell BE, included 512MB and 1GB of main memory per chip respectively. IBM QS22, the latest Cell￾based system, supports up to 16GB of main memory per chip. IBM QS22 is based on the PowerXCell 8i chip, which is an evolution of the original Cell BE chip, modified to support a larger main mem￾ory capacity [6]. These two examples illustrate the current trend that allows increasingly larger data structures to be hosted by ac￾celerators and justifies our ADSM design. Programming models for current heterogeneous parallel sys￾tems, such as NVIDIA CUDA [11] and OpenCL [1], present dif￾ferent memories in the system as distinct memory spaces to the programmer. Applications explicitly request memory from a given memory space (i.e. cudaMalloc()) and perform data transfers be￾tween different memory spaces (i.e. cudaMemcpy()). The exam￾ple in Figure 3 illustrates this situation. First, system memory is allocated (malloc()) and initialized (fread()). Then, accelera￾tor memory is allocated (cudaMalloc()) and the data structure is copied to the accelerator memory (cudaMemcpy()), before code is Figure 3. Example code with duplicated pointers and explicit con￾sistency management executed on the accelerator. OpenCL and Roadrunner codes do not significantly differ from code in Figure 3. Such programming models ensure that data structures reside in the memory of the processor (CPU or accelerator), that performs subsequent computations. These models also imply that program￾mers must explicitly request memory on different processors and, thus, a data structure (foo in Figure 3) is referenced by two differ￾ent memory addresses: foo, a virtual address in system memory, and dev foo, a physical address in the accelerator memory. Pro￾grammers must explicitly manage memory coherence (e.g., with a call to cudaMemcpy()) before executing kernels on the accelerator. This approach also prevents parameters from being passed by refer￾ence to accelerator kernels [19] and computationally critical meth￾ods to return pointers to the output data structures instead of return￾ing the whole output data structure, which would save bandwidth whenever the code at CPU only requires accessing a small portion of the returned data structure. These approaches harm portability because they expose data transfer details of the underlaying hard￾ware. Offering a programming interface that requires a single allo￾cation call and removes the need for explicit data transfers would increase programmability and portability of heterogeneous systems and is the first motivation of this paper. The cost of data transfers between general purpose CPUs and accelerators might eliminate the benefit of using accelerators. Dou￾ble buffering can help to alleviate this situation by transferring parts of the data structure while other parts are still in use. In the ex￾ample of Figure 3, the input data would be read iteratively using a call to fread() followed by an asynchronous DMA transfer to the accelerator memory. Synchronization code is necessary to pre￾vent overwriting system memory that is still in use by an ongoing DMA transfer [19]. The coding effort to reduce the cost of data transfer harms programmability of heterogeneous systems. Auto￾matically overlapping data transfers and CPU computation without code modifications is the second motivation of this paper. 3. Asymmetric Distributed Shared Memory Asymmetric Distributed Shared Memory (ADSM) maintains a shared logical memory space for CPUs to access objects in the ac￾celerator physical memory but not vice versa. This section presents ADSM as a data-centric programming model and the benefit of an asymmetric shared address space. 3.1 ADSM as a Data-Centric Programming Model In a data-centric programming model, programmers allocate or de￾clare data objects that are processed by methods, and annotate per￾formance critical methods (kernels) that are executed by accelera￾tors. When such methods are assigned to an accelerator, their cor- 349
<<向上翻页向下翻页>>