As illustrated in Fig.4,the local block-wide scans are typically inclusive prefix.Used to record the partition's implemented using a parallel reduce-then-scan strategy.This inclusive prefix as computed by reducing aggregate allows the thread blocks to perform local upsweep and with the accumulated look-back from preceding downsweep work in parallel,each core only having to wait at its partitions. phase-transition for the required prefix to be produced by its status flag.The flag describes the partition's current predecessor.The running prefix is then aggregated into the block- status as one of the following: wide downsweep. Chained-scan's performance is limited by the latency of signal A-aggregate available.Indicates the aggregate propagation between thread blocks.For example,suppose a field has been recorded for the associated partition. NVIDIA Tesla C2050 GPU processor with 144GB/sec memory P-prefix available.Indicates the inclusive prefix bandwidth,processor cores operating at 1.15GHz,and a latency of field has been recorded for the associated partition. 600 clock cycles to pass a message from one core to another through the memory hierarchy.This signaling latency limits the X-invalid.Indicates no information about the system to a scan throughput of 1.9M partitions/sec.If each core is partition is available to other processors. All assigned a partition of 256 inputs(32-bits each),the maximum- descriptors are initialized with status X. achievable scan throughput will be limited to 490M inputs/sec. 2 Synchronize.All processors synchronize to ensure a This is well short of the theoretical memory bandwidth limitation consistent view of initialized partition descriptors. of limit of 18B inputs/sec for simply reading and writing the data. Compute and record the partition-wide aggregate.Each When limited by partition-signaling,one solution is to increase processor computes and records its partition-wide the partition size.(In the previous example,a partition size of aggregate to the corresponding partition descriptor.It 9000 inputs per thread block would be required to saturate then executes a memory fence and updates the descriptor's memory bandwidth.However,it may be impractical to increase status flag to A.Furthermore,the processor owning the the partition size arbitrarily.Many processor designs have limited storage capacity per core.For example,GPU processors provide first partition copies aggregate to the inclusive prefix field,updates status flag to P,and skips to Step 6 below. virtualized cores whose performance is optimized when the working set fits completely within on-chip register file and shared 4 Determine the partition's exclusive prefix using memory resources.Our method,however,is insensitive to decoupled look-back.Each processor initializes and partition size and is therefore amenable to diverse architectural maintains a running exclusive prefix as it progressively configurations. inspects the descriptors of increasingly antecedent partitions,beginning with the immediately preceding 4.Decoupled Lookback partition. For each predecessor,the processor will conditionally perform the following based upon the 4.1 Operation predecessor's status_flag: Our method is a generalization of the chained-scan approach with X--Block (or continue polling)until the status flag is dramatically reduced prefix propagation latencies.The idea is to not. decouple the singular dependence of each processor on its A--The predecessor's aggregate field is added to immediate predecessor at the expense of progressively redundant exclusive prefix and the processor continues on to computation.Whereas the chained-scan approach has a fixed inspect the preceding tile. "look-back"of one partition,our method allows processors to inspect the status of predecessors that are increasingly further P-The predecessor's inclusive prefix field is added to away.This is illustrated in Fig.5. exclusive_prefix and the look-back phase is terminated. As each partition is processed,its status is updated with the 5. Compute and record the partition-wide inclusive partition-wide aggregate.Each aggregate is simply the reduction prefixes.Each processor adds exclusive prefix to of items within partition,and can be computed independently aggregate and records the result to the descriptor's from other partitions.Because the aggregates are readily inclusive prefix field.It then executes a memory fence available,each processor is generally free to peruse the aggregates and updates the descriptor's status flag to P. recorded for preceding partitions,progressively accumulating them until complete exclusive prefix is known (the running total 6. Perform a partition-wide scan seeded with the partition's across all prior partitions).The partition's status is then updated exclusive prefix.Each processor completes the scan of its with the inclusive prefix,which is computed from the exclusive partition,incorporating exclusive_prefix into every output value prefix and the partition-wide aggregate.The exclusive prefix is then used to begin the partition's downsweep scan phase. The computation of each processor can proceed independently Furthermore,an opportunistic encounter with a predecessor's and in parallel with other processors throughout steps 1,3,5,and inclusive prefix permits early-termination of the look-back phase. 6.In Step 4(decoupled look-back),each processor must wait on More specifically,each parallel processor within our method its predecessor(s)to finish Step 3 (record partition-wide operates as follows: reduction). 1. Initialize the partition descriptor.Each processor's partition is allocated a status descriptor having the 4.2 Properties following fields: Our method has the following properties: aggregate.Used to record the partition-wide Safery.The algorithm will run to completion if the system aggregate as computed by the upsweep phase of guarantees forward-progress for all processors.Forward partition-scan. progress ensures that no processor will wait indefinitely in Step 4:every predecessor is free to record its aggregate in 55 As illustrated in Fig. 4, the local block-wide scans are typically implemented using a parallel reduce-then-scan strategy. This allows the thread blocks to perform local upsweep and downsweep work in parallel, each core only having to wait at its phase-transition for the required prefix to be produced by its predecessor. The running prefix is then aggregated into the block￾wide downsweep. Chained-scan’s performance is limited by the latency of signal propagation between thread blocks. For example, suppose a NVIDIA Tesla C2050 GPU processor with 144GB/sec memory bandwidth, processor cores operating at 1.15GHz, and a latency of 600 clock cycles to pass a message from one core to another through the memory hierarchy. This signaling latency limits the system to a scan throughput of 1.9M partitions/sec. If each core is assigned a partition of 256 inputs (32-bits each), the maximum￾achievable scan throughput will be limited to 490M inputs/sec. This is well short of the theoretical memory bandwidth limitation of limit of 18B inputs/sec for simply reading and writing the data. When limited by partition-signaling, one solution is to increase the partition size. (In the previous example, a partition size of 9000 inputs per thread block would be required to saturate memory bandwidth.) However, it may be impractical to increase the partition size arbitrarily. Many processor designs have limited storage capacity per core. For example, GPU processors provide virtualized cores whose performance is optimized when the working set fits completely within on-chip register file and shared memory resources. Our method, however, is insensitive to partition size and is therefore amenable to diverse architectural configurations. 4. Decoupled Lookback 4.1 Operation Our method is a generalization of the chained-scan approach with dramatically reduced prefix propagation latencies. The idea is to decouple the singular dependence of each processor on its immediate predecessor at the expense of progressively redundant computation. Whereas the chained-scan approach has a fixed “look-back” of one partition, our method allows processors to inspect the status of predecessors that are increasingly further away. This is illustrated in Fig. 5. As each partition is processed, its status is updated with the partition-wide aggregate. Each aggregate is simply the reduction of items within partition, and can be computed independently from other partitions. Because the aggregates are readily available, each processor is generally free to peruse the aggregates recorded for preceding partitions, progressively accumulating them until complete exclusive prefix is known (the running total across all prior partitions). The partition’s status is then updated with the inclusive prefix, which is computed from the exclusive prefix and the partition-wide aggregate. The exclusive prefix is then used to begin the partition’s downsweep scan phase. Furthermore, an opportunistic encounter with a predecessor’s inclusive prefix permits early-termination of the look-back phase. More specifically, each parallel processor within our method operates as follows: 1. Initialize the partition descriptor. Each processor’s partition is allocated a status descriptor having the following fields:  aggregate. Used to record the partition-wide aggregate as computed by the upsweep phase of partition-scan.  inclusive_prefix. Used to record the partition’s inclusive prefix as computed by reducing aggregate with the accumulated look-back from preceding partitions.  status_flag. The flag describes the partition’s current status as one of the following: A – aggregate available. Indicates the aggregate field has been recorded for the associated partition. P – prefix available. Indicates the inclusive_prefix field has been recorded for the associated partition. X – invalid. Indicates no information about the partition is available to other processors. All descriptors are initialized with status X. 2. Synchronize. All processors synchronize to ensure a consistent view of initialized partition descriptors. 3. Compute and record the partition-wide aggregate. Each processor computes and records its partition-wide aggregate to the corresponding partition descriptor. It then executes a memory fence and updates the descriptor’s status_flag to A. Furthermore, the processor owning the first partition copies aggregate to the inclusive_prefix field, updates status_flag to P, and skips to Step 6 below. 4. Determine the partition’s exclusive prefix using decoupled look-back. Each processor initializes and maintains a running exclusive_prefix as it progressively inspects the descriptors of increasingly antecedent partitions, beginning with the immediately preceding partition. For each predecessor, the processor will conditionally perform the following based upon the predecessor’s status_flag: X -- Block (or continue polling) until the status_flag is not X. A -- The predecessor’s aggregate field is added to exclusive_prefix and the processor continues on to inspect the preceding tile. P -- The predecessor’s inclusive_prefix field is added to exclusive_prefix and the look-back phase is terminated. 5. Compute and record the partition-wide inclusive prefixes. Each processor adds exclusive_prefix to aggregate and records the result to the descriptor’s inclusive_prefix field. It then executes a memory fence and updates the descriptor’s status_flag to P. 6. Perform a partition-wide scan seeded with the partition’s exclusive prefix. Each processor completes the scan of its partition, incorporating exclusive_prefix into every output value. The computation of each processor can proceed independently and in parallel with other processors throughout steps 1, 3, 5, and 6. In Step 4 (decoupled look-back), each processor must wait on its predecessor(s) to finish Step 3 (record partition-wide reduction). 4.2 Properties Our method has the following properties:  Safety. The algorithm will run to completion if the system guarantees forward-progress for all processors. Forward progress ensures that no processor will wait indefinitely in Step 4: every predecessor is free to record its aggregate in
<<向上翻页向下翻页>>