run-time system improves programmability and compatibility of heterogeneous systems.This paper introduces the Asymmetric Dis- tributed Shared Memory (ADSM)model,a data-centric program- ming model,that maintains a shared logical memory space for RAM Memory CPU CPUs to access objects in the accelerator physical memory but not cores RAM Memory vice versa.This asymmetry allows all coherence and consistency ache Memory actions to be executed on the CPU,allowing the use of simple ac- RAM Memory celerators.This paper also presents GMAC,a user-level ADSM library,and discusses design and implementation details of such a system.Experimental results using GMAC show that an ADSM 10 PCIe Accelerator system makes heterogeneous systems easier to program without in- HUB troducing performance penalties. The main contributions of this paper are:(1)the introduction of ADSM as a data-centric programming model for heterogeneous systems.The benefits of this model are architecture independence Figure 1.Reference Architecture,similar to desktop GPUs and legacy support,and efficient I/O support;(2)a detailed discussion RoadRunner blades about the design of an ADSM system,which includes the definition of the necessary API calls and the description of memory coherence and consistency required by an ADSM system;(3)a description consistency models implemented by accelerators allow memory of the software techniques required to build an ADSM system for controllers to serve several requests in a single memory access current accelerators on top of existing operating systems;(4)an Strong consistency models required by general purpose CPUs do analysis of different coherence protocols that can be implemented not offer the same freedom to rearrange accesses to system mem- in an ADSM system. ory.Memory access scheduling in the memory controller has differ- This paper is organized as follows.Section 2 presents the neces- ent requirements for general purpose CPUs and accelerators(i.e., sary background and motivates this work.ADSM is presented as a latency vs throughput).Virtual memory management also tends data-centric programming model in Section 3,which also discusses to be quite different on CPUs and accelerators (e.g.,GPUs tend the benefit of ADSM for heterogeneous systems and presents the to benefit more from large page sizes than CPUs),which makes API,consistency model,and different coherence protocols for the design of TLBs and MMUs quite different (e.g.,incompatible ADSM systems.The design and implementation of an ADSM memory page sizes).Hence,general purpose CPUs and acceler- system,GMAC,is presented in Section 4.Section 5 presents ex- ators are connected to separate memories in most heterogeneous perimental results.ADSM is compared to other work in Section 6. systems,as shown in Figure 1.Many such examples of hetero Finally,Section 7 concludes this paper. geneous systems currently exist.The NVIDIA GeForce graphics card [35]includes its own GDDR memory (up to 4GB)and is at- tached to the CPU through a PCle bus.Future graphics cards based 2. Background and Motivation on the Intel Larrabee [40]chip will have a similar configuration. 2.1 Background The Roadrunner supercomputer is composed of nodes that include two AMD Opteron CPUs (IBM BladeCenter LS21)and four Pow- General purpose CPUs and accelerators can be coupled in many erXCell chips(2x IBM BladeCenter QS22).Each LS21 BladeCen- different ways.Fine-grained accelerators are usually attached as ter is connected to two QS22 BladeCenters through a PCle bus. functional units inside the processor pipeline [21,22.41,43].The constraining processors to access only on-board memory [6].In Xilinx Virtex 5 FXT FPGAs include a PowerPC 440 connected this paper we assume a base heterogeneous system in Figure 1. to reconfigurable logic by a crossbar [46].In the Cell BE chip. However,the concepts developed in this paper are equally applica- the Synergistic Processing Units,L2 cache controller,the memory ble to systems where general purpose CPUs and accelerators share interface controller.and the bus interface controller are connected the same physical memory. through an Element Interconnect Bus [30].The Intel Graphics Media Accelerator is integrated inside the Graphics and Memory 2.2 Motivation Controller Hub that manages the flow of information between the Heterogeneous parallel computing improves application perfor- processor,the system memory interface,the graphics interface,and mance by executing computationally intensive data-parallel kernels the I/O controller [271.AMD Fusion chips will integrate CPU. on accelerators designed to maximize data throughput,while exe- memory controller,GPU,and PCIe Controller into a single chip.A cuting the control-intensive code on general purpose CPUs.Hence, common characteristic among Virtex 5,Cell BE,Graphics Media some data structures are likely to be accessed primarily by the code Accelerator,and AMD Fusion is that general purpose CPUs and executed by accelerators.For instance,execution traces show that accelerators share access to system memory.In these systems,the about 99%of read and write accesses to the main data structures in system memory controller deals with memory requests coming the NASA Parallel Benchmarks (NPB)occur inside computation- from both general purpose CPUs and accelerators. ally intensive kernels that are amenable for parallelization. Accelerators and general purpose CPUs impose very differ- Figure 2 shows our estimation for the average memory band- ent requirements on the system memory controller.General pur- width requirements for the computationally intensive kernels of pose CPUs are designed to minimize the instruction latency and some NPB benchmarks,assuming a 800MHz clock frequency for typically implement some form of strong memory consistency different values of IPC and illustrates the need to store the data (e.g.,sequential consistency in MIPS processors).Accelerators structures required by accelerators in their own memories.For in- are designed to maximize data throughput and implement weak stance,if all data accesses are done through a PCle bus',the max- forms of memory consistency (e.g.Rigel implements weak consis- imum achievable value of IPC is 50 for bt and 5 for ua,which tency [32)).Memory controllers for general purpose CPUs tend to implement narrow memory buses (e.g.192 bits for the Intel Core 1 If both accelerator and CPU share the same memory controller,the avail- i7)compared to data parallel accelerators (e.g.512 bits for the able accelerator bandwidth will be similar to HyperTransport in Figure 2, NVIDIA GTX280)to minimize the memory access time.Relaxed which also limits the maximum achievable value of IPC. 348run-time system improves programmability and compatibility of heterogeneous systems. This paper introduces the Asymmetric Dis￾tributed Shared Memory (ADSM) model, a data-centric program￾ming model, that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. This asymmetry allows all coherence and consistency actions to be executed on the CPU, allowing the use of simple ac￾celerators. This paper also presents GMAC, a user-level ADSM library, and discusses design and implementation details of such a system. Experimental results using GMAC show that an ADSM system makes heterogeneous systems easier to program without in￾troducing performance penalties. The main contributions of this paper are: (1) the introduction of ADSM as a data-centric programming model for heterogeneous systems. The benefits of this model are architecture independence, legacy support, and efficient I/O support; (2) a detailed discussion about the design of an ADSM system, which includes the definition of the necessary API calls and the description of memory coherence and consistency required by an ADSM system; (3) a description of the software techniques required to build an ADSM system for current accelerators on top of existing operating systems; (4) an analysis of different coherence protocols that can be implemented in an ADSM system. This paper is organized as follows. Section 2 presents the neces￾sary background and motivates this work. ADSM is presented as a data-centric programming model in Section 3, which also discusses the benefit of ADSM for heterogeneous systems and presents the API, consistency model, and different coherence protocols for ADSM systems. The design and implementation of an ADSM system, GMAC, is presented in Section 4. Section 5 presents ex￾perimental results. ADSM is compared to other work in Section 6. Finally, Section 7 concludes this paper. 2. Background and Motivation 2.1 Background General purpose CPUs and accelerators can be coupled in many different ways. Fine-grained accelerators are usually attached as functional units inside the processor pipeline [21, 22, 41, 43]. The Xilinx Virtex 5 FXT FPGAs include a PowerPC 440 connected to reconfigurable logic by a crossbar [46]. In the Cell BE chip, the Synergistic Processing Units, L2 cache controller, the memory interface controller, and the bus interface controller are connected through an Element Interconnect Bus [30]. The Intel Graphics Media Accelerator is integrated inside the Graphics and Memory Controller Hub that manages the flow of information between the processor, the system memory interface, the graphics interface, and the I/O controller [27]. AMD Fusion chips will integrate CPU, memory controller, GPU, and PCIe Controller into a single chip. A common characteristic among Virtex 5, Cell BE, Graphics Media Accelerator, and AMD Fusion is that general purpose CPUs and accelerators share access to system memory. In these systems, the system memory controller deals with memory requests coming from both general purpose CPUs and accelerators. Accelerators and general purpose CPUs impose very differ￾ent requirements on the system memory controller. General pur￾pose CPUs are designed to minimize the instruction latency and typically implement some form of strong memory consistency (e.g., sequential consistency in MIPS processors). Accelerators are designed to maximize data throughput and implement weak forms of memory consistency (e.g. Rigel implements weak consis￾tency [32]). Memory controllers for general purpose CPUs tend to implement narrow memory buses (e.g. 192 bits for the Intel Core i7) compared to data parallel accelerators (e.g. 512 bits for the NVIDIA GTX280) to minimize the memory access time. Relaxed Figure 1. Reference Architecture, similar to desktop GPUs and RoadRunner blades consistency models implemented by accelerators allow memory controllers to serve several requests in a single memory access. Strong consistency models required by general purpose CPUs do not offer the same freedom to rearrange accesses to system mem￾ory. Memory access scheduling in the memory controller has differ￾ent requirements for general purpose CPUs and accelerators (i.e., latency vs throughput). Virtual memory management also tends to be quite different on CPUs and accelerators (e.g., GPUs tend to benefit more from large page sizes than CPUs), which makes the design of TLBs and MMUs quite different (e.g., incompatible memory page sizes). Hence, general purpose CPUs and acceler￾ators are connected to separate memories in most heterogeneous systems, as shown in Figure 1. Many such examples of hetero￾geneous systems currently exist. The NVIDIA GeForce graphics card [35] includes its own GDDR memory (up to 4GB) and is at￾tached to the CPU through a PCIe bus. Future graphics cards based on the Intel Larrabee [40] chip will have a similar configuration. The Roadrunner supercomputer is composed of nodes that include two AMD Opteron CPUs (IBM BladeCenter LS21) and four Pow￾erXCell chips (2x IBM BladeCenter QS22). Each LS21 BladeCen￾ter is connected to two QS22 BladeCenters through a PCIe bus, constraining processors to access only on-board memory [6]. In this paper we assume a base heterogeneous system in Figure 1. However, the concepts developed in this paper are equally applica￾ble to systems where general purpose CPUs and accelerators share the same physical memory. 2.2 Motivation Heterogeneous parallel computing improves application perfor￾mance by executing computationally intensive data-parallel kernels on accelerators designed to maximize data throughput, while exe￾cuting the control-intensive code on general purpose CPUs. Hence, some data structures are likely to be accessed primarily by the code executed by accelerators. For instance, execution traces show that about 99% of read and write accesses to the main data structures in the NASA Parallel Benchmarks (NPB) occur inside computation￾ally intensive kernels that are amenable for parallelization. Figure 2 shows our estimation for the average memory band￾width requirements for the computationally intensive kernels of some NPB benchmarks, assuming a 800MHz clock frequency for different values of IPC and illustrates the need to store the data structures required by accelerators in their own memories. For in￾stance, if all data accesses are done through a PCIe bus1 , the max￾imum achievable value of IPC is 50 for bt and 5 for ua, which 1 If both accelerator and CPU share the same memory controller, the avail￾able accelerator bandwidth will be similar to HyperTransport in Figure 2, which also limits the maximum achievable value of IPC. 348
<<向上翻页向下翻页>>