Single-pass Parallel Prefix Scan with Decoupled Look-back Duane Merrill Michael Garland NVIDIA Corporation NVIDIA Corporation dumerrill@nvidia.com mgarland nvidia.com Abstract In modern computer systems,the performance and power consumption of prefix scan is typically bound by the cost of data We describe a work-efficient,communication-avoiding,single- movement:reading inputs and writing results to memory is pass method for the parallel computation of prefix scan.When generally more expensive than computing the reduction consuming input from memory,our algorithm requires only ~2n operations themselves.Therefore communication avoidance data movement:n inputs are read,n outputs are written.Our (minimizing last-level data movement)is a practical design method embodies a decoupled look-back strategy that performs objective for parallel prefix scan.The sequential prefix scan redundant work to dissociate local computation from the latencies of global prefix propagation.Implemented by the CUB library of algorithm requires only a single pass through the data to accumulate and progressively output the running total.As such,it parallel primitives for GPU architectures,the performance incurs the optimal 2n data movement:n reads and n writes. throughput of our parallel prefix scan approaches that of copy Contemporary GPU scan parallelization strategies such as operations.Furthermore,the single-pass nature of our method reduce-then-scan are typically memory-bound,but impose ~3n allows it to be adapted for (1)in-place compaction behavior,and global data movement [2,16,22].Furthermore,they perform (2)in-situ global allocation within computations that two full passes over the input,which precludes them from serving oversubscribe the processor. as in-situ global allocation mechanisms within computations that 1. Introduction oversubscribe the processor.Finally,these scan algorithms cannot be modified for in-place compaction behavior (selection, Parallel prefix scan is a fundamental parallel computing primitive. run-length-encoding.duplicate removal,etc.because the Given a list of input elements and a binary reduction operator,a execution order of thread blocks within the output pass is prefix scan produces a corresponding output list where each unconstrained.Separate storage is required for the compacted output is computed to be the reduction of the elements occurring output to prevent race conditions where inputs might otherwise be earlier in the input.A prefix stmn connotes a prefix scan with the overwritten before they can be read. addition operator,i.e.,each output number is the sum of the Alternatively,the chained-scan GPU parallelization [11,27] corresponding numbers occurring previously in the input list.An operates in a single pass,but is hindered by serial prefix inclusive scan indicates that theoutput reduction incorporates dependences between adjacent processors that prevent memory thei input element.An exclusive scan indicates the i input is 1/O from fully saturating [27.In comparison,our decoupled- not incorporated into theoutput reduction. Applications of lookback algorithm elides these serial dependences at the expense scan include adder design,linear recurrence and tridiagonal of bounded redundant computation.As a result,our prefix scan solvers,parallel allocation and queuing,list compaction and computations (as well as adaptations for in-place compaction partitioning,segmented reduction,etc.For example,an exclusive behavior and in-sit allocation)are typically capable of saturating prefix sum across a list of allocation requirements [8,6,7,5,3,0,9] memory bandwidth in a single pass. produces a corresponding list of allocation offsets 0,8,14,21,26,29,291 2. Background In this report,we describe the decoupled-lookback method of Parallel solutions to scan problems have been investigated for single-pass parallel prefix scan and its implementation within the decades.In fact,the earliest research predates the discipline of open-source CUB library of GPU parallel primitives [21].For Computer Science itself:scan circuits are fundamental to the highly parallel architectures,prefix sum is a scalable mechanism operation of fast adder hardware (e.g.,carry-skip adder,carry- for cooperative allocation within dynamic and irregular data select adders,and carry-lookahead adder)[8,241.As such,many structures [4,20].Contemporary GPUs are at the leading edge of commonplace scan parallelizations are presented as recursively- the current trend of increased parallelism in computer defined,acyclic dataflow networks in the circuit model [7,26]of architecture,provisioning tens of thousands of data parallel parallel computation.In this model,prefix scan can be thought of threads.As such,prefix scan plays an important role in many as a forest of reduction trees,one for each output.Network size is GPU algorithms. reduced when reduction trees share intermediate partial sums.For practical use in computer software,scan networks are typically NVIDIA Technical Report NVR-2016-002,March 2016 encoded as imperative algorithms in the PRAM model [13,14]. CUB v1.0.1,August 2013 The minimum circuit depth and size for a parallel scan (c)NVIDIA Corporation.All rights reserved network are logan steps and n-I operators,respectively.However, This research was developed,in part,with funding from the Defense Advanced Research there are no known O(n)work-efficient networks having depth Projects Agency (DARPA).The views,opinions,and/or findings contained in this logan.Snir provides a lower-bound regarding depth+size article/presentation are those of the author(s)presenter(s)and should not be interpreted as optimality (DSO)for parallel prefix networks:for a given network representing the official views or policies of the Department of Defense or the U.S of size s gates and depth dlevels,d+s>2n-2 [25].His research Govemment
Single-pass Parallel Prefix Scan with Decoupled Look-back Duane Merrill NVIDIA Corporation dumerrill@nvidia.com Michael Garland NVIDIA Corporation mgarland@nvidia.com Abstract We describe a work-efficient, communication-avoiding, singlepass method for the parallel computation of prefix scan. When consuming input from memory, our algorithm requires only ~2n data movement: n inputs are read, n outputs are written. Our method embodies a decoupled look-back strategy that performs redundant work to dissociate local computation from the latencies of global prefix propagation. Implemented by the CUB library of parallel primitives for GPU architectures, the performance throughput of our parallel prefix scan approaches that of copy operations. Furthermore, the single-pass nature of our method allows it to be adapted for (1) in-place compaction behavior, and (2) in-situ global allocation within computations that oversubscribe the processor. 1. Introduction Parallel prefix scan is a fundamental parallel computing primitive. Given a list of input elements and a binary reduction operator, a prefix scan produces a corresponding output list where each output is computed to be the reduction of the elements occurring earlier in the input. A prefix sum connotes a prefix scan with the addition operator, i.e., each output number is the sum of the corresponding numbers occurring previously in the input list. An inclusive scan indicates that the i th output reduction incorporates the i th input element. An exclusive scan indicates the i th input is not incorporated into the i th output reduction. Applications of scan include adder design, linear recurrence and tridiagonal solvers, parallel allocation and queuing, list compaction and partitioning, segmented reduction, etc. For example, an exclusive prefix sum across a list of allocation requirements [8,6,7,5,3,0,9] produces a corresponding list of allocation offsets [0,8,14,21,26,29,29]. In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives [21]. For highly parallel architectures, prefix sum is a scalable mechanism for cooperative allocation within dynamic and irregular data structures [4, 20]. Contemporary GPUs are at the leading edge of the current trend of increased parallelism in computer architecture, provisioning tens of thousands of data parallel threads. As such, prefix scan plays an important role in many GPU algorithms. In modern computer systems, the performance and power consumption of prefix scan is typically bound by the cost of data movement: reading inputs and writing results to memory is generally more expensive than computing the reduction operations themselves. Therefore communication avoidance (minimizing last-level data movement) is a practical design objective for parallel prefix scan. The sequential prefix scan algorithm requires only a single pass through the data to accumulate and progressively output the running total. As such, it incurs the optimal 2n data movement: n reads and n writes. Contemporary GPU scan parallelization strategies such as reduce-then-scan are typically memory-bound, but impose ~3n global data movement [2, 16, 22]. Furthermore, they perform two full passes over the input, which precludes them from serving as in-situ global allocation mechanisms within computations that oversubscribe the processor. Finally, these scan algorithms cannot be modified for in-place compaction behavior (selection, run-length-encoding, duplicate removal, etc.) because the execution order of thread blocks within the output pass is unconstrained. Separate storage is required for the compacted output to prevent race conditions where inputs might otherwise be overwritten before they can be read. Alternatively, the chained-scan GPU parallelization [11, 27] operates in a single pass, but is hindered by serial prefix dependences between adjacent processors that prevent memory I/O from fully saturating [27]. In comparison, our decoupledlookback algorithm elides these serial dependences at the expense of bounded redundant computation. As a result, our prefix scan computations (as well as adaptations for in-place compaction behavior and in-situ allocation) are typically capable of saturating memory bandwidth in a single pass. 2. Background Parallel solutions to scan problems have been investigated for decades. In fact, the earliest research predates the discipline of Computer Science itself: scan circuits are fundamental to the operation of fast adder hardware (e.g., carry-skip adder, carryselect adders, and carry-lookahead adder) [8, 24]. As such, many commonplace scan parallelizations are presented as recursivelydefined, acyclic dataflow networks in the circuit model [7, 26] of parallel computation. In this model, prefix scan can be thought of as a forest of reduction trees, one for each output. Network size is reduced when reduction trees share intermediate partial sums. For practical use in computer software, scan networks are typically encoded as imperative algorithms in the PRAM model [13, 14]. The minimum circuit depth and size for a parallel scan network are log2n steps and n-1 operators, respectively. However, there are no known O(n) work-efficient networks having depth log2n. Snir provides a lower-bound regarding depth+size optimality (DSO) for parallel prefix networks: for a given network of size s gates and depth d levels, d + s ≥ 2n – 2 [25]. His research NVIDIA Technical Report NVR-2016-002, March 2016. CUB v1.0.1, August 2013 (c) NVIDIA Corporation. All rights reserved. This research was developed, in part, with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings contained in this article/presentation are those of the author(s)/presenter(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government
4 (a)serial (chained scan) (b)Kogge-Stone (c)Sklansky (d)Brent-Kung (e)Reduce-then-scan Fig.1.Commonplace scan constructions for n=16. Dashed boxes illustrate recursive construction. suggests that as circuit depth is increasingly constrained,DSO Whereas minimum-depth circuits are fast when the input size networks no longer exist and the size of the networks increases is less than or equal to the width of the underlying multiprocessor, rapidly. minimum-size networks are important for larger problems.Many Fig.I presents commonplace scan networks relevant to parallel programming models virtualize the underlying physical contemporary prefix scan algorithms.Although the serial,or processors,causing overall runtime to scale with circuit size chained scan,construction in Fig.la has maximal n-1 depth and instead of depth.Therefore work-efficiency is often a practical no concurrent computations,its minimal n-l size makes it an design objective for general-purpose prefix scan algorithms. attractive subcomponent of scan networks designed for The Brent-Kung construction [8]in Fig.Id (and corresponding oversubscribed processors. Increasing the computational Blelloch algorithm [4,5])is a work-efficient strategy having granularity (i.e.,items per thread)is a common technique for 2logan depth and O(n)size.Visually,the data flow resembles an improving processor utilization by reducing inter-thread "hourglass"shape comprising(1)an upsweep accumulation tree communication. having progressively less parallelism,and (2)a downsweep The Kogge-Stone construction [19]in Fig.Ib (and propagation tree exhibiting progressively more parallelism. corresponding Hillis-Steele algorithm [17))is a well-known, Generalizing the binary operators in the upsweep with radix-b minimum-depth network that uses a recursive-doubling approach scans and those in the downsweep with radix-b fans,the Brent- for aggregating partial reductions.Despite having inefficient Kung strategy exhibits a more pronounced scan-then-propagate O(nlogan)work complexity,its shallow depth and simple shared behavior (as illustrated in Fig 2a). memory address computations make it an attractive strategy for For programming models that virtualize an unlimited number SIMD architectures(e.g.,GPU warps)where inactive processor of processors,a concern with scan-then-propagate data flow is resources cannot be scavenged' that~n live values are spilled and filled through last-level memory The Sklansky construction [24]in Fig.Ic employs a recursive, between upsweep and downsweep phases when the input exceeds scan-then-fan approach that also achieves minimum depth logan at on-chip memory.To eliminate the writes,we can simply the expense of O(nlog n)work complexity.Compared to Kogge- rematerialize the intermediates during the downsweep phase at the Stone constructions,these networks exhibit high-radix fan-out expense of O(n)redundant calculations,as shown in Fig.le [9. when propagating partial prefixes computed by recursive 12,22].This has the effect of converting downsweep behavior subgraphs.This improved sharing leads to smaller circuit sizes from propagation to scan. We refer to this adaptation as the and reduced memory bandwidth overheads. reduce-then-scan strategy. In general,an important property of recursive network design is the ability to mix-and-match different strategies at different levels.Further variation is also possible through operator IKogge-Stone "warpscans"are typical of GPU implementations where generalization:whereas these binary operators compute radix-2 (1)SIMD-synchronicity has historically enabled efficient barrier-free scans and fans,network height can be reduced using radix-b communication,and(2)the hardware provisions a "shuffle"crossbar for subcomponents as building blocks [18].This flexibility allows for efficient inter-warp communication
2 suggests that as circuit depth is increasingly constrained, DSO networks no longer exist and the size of the networks increases rapidly. Fig. 1 presents commonplace scan networks relevant to contemporary prefix scan algorithms. Although the serial, or chained scan, construction in Fig. 1a has maximal n-1 depth and no concurrent computations, its minimal n-1 size makes it an attractive subcomponent of scan networks designed for oversubscribed processors. Increasing the computational granularity (i.e., items per thread) is a common technique for improving processor utilization by reducing inter-thread communication. The Kogge-Stone construction [19] in Fig. 1b (and corresponding Hillis-Steele algorithm [17]) is a well-known, minimum-depth network that uses a recursive-doubling approach for aggregating partial reductions. Despite having inefficient O(nlog2n) work complexity, its shallow depth and simple shared memory address computations make it an attractive strategy for SIMD architectures (e.g., GPU warps) where inactive processor resources cannot be scavenged1 . The Sklansky construction [24] in Fig. 1c employs a recursive, scan-then-fan approach that also achieves minimum depth log2n at the expense of O(nlog2n) work complexity. Compared to KoggeStone constructions, these networks exhibit high-radix fan-out when propagating partial prefixes computed by recursive subgraphs. This improved sharing leads to smaller circuit sizes and reduced memory bandwidth overheads. 1 Kogge-Stone “warpscans” are typical of GPU implementations where (1) SIMD-synchronicity has historically enabled efficient barrier-free communication, and (2) the hardware provisions a “shuffle” crossbar for efficient inter-warp communication. Whereas minimum-depth circuits are fast when the input size is less than or equal to the width of the underlying multiprocessor, minimum-size networks are important for larger problems. Many parallel programming models virtualize the underlying physical processors, causing overall runtime to scale with circuit size instead of depth. Therefore work-efficiency is often a practical design objective for general-purpose prefix scan algorithms. The Brent-Kung construction [8] in Fig. 1d (and corresponding Blelloch algorithm [4, 5]) is a work-efficient strategy having 2log2n depth and O(n) size. Visually, the data flow resembles an “hourglass” shape comprising (1) an upsweep accumulation tree having progressively less parallelism, and (2) a downsweep propagation tree exhibiting progressively more parallelism. Generalizing the binary operators in the upsweep with radix-b scans and those in the downsweep with radix-b fans, the BrentKung strategy exhibits a more pronounced scan-then-propagate behavior (as illustrated in Fig. 2a). For programming models that virtualize an unlimited number of processors, a concern with scan-then-propagate data flow is that ~n live values are spilled and filled through last-level memory between upsweep and downsweep phases when the input exceeds on-chip memory. To eliminate the writes, we can simply rematerialize the intermediates during the downsweep phase at the expense of O(n) redundant calculations, as shown in Fig. 1e [9, 12, 22]. This has the effect of converting downsweep behavior from propagation to scan. We refer to this adaptation as the reduce-then-scan strategy. In general, an important property of recursive network design is the ability to mix-and-match different strategies at different levels. Further variation is also possible through operator generalization: whereas these binary operators compute radix-2 scans and fans, network height can be reduced using radix-b subcomponents as building blocks [18]. This flexibility allows for (a) serial (chained scan) (b) Kogge-Stone (c) Sklansky (d) Brent-Kung (e) Reduce-then-scan Fig. 1. Commonplace scan constructions for n = 16. Dashed boxes illustrate recursive construction. x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10 x0 :x11 x0 :x12 x0 :x13 x0 :x14 x0 :x15 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 : x0 x0 : x1 x0 : x2 x0 : x3 x0 : x4 x0 : x5 x0 : x6 x0 : x7 x0 : x8 x0 : x9 x0 : x10 x0 : x11 x0 : x12 x0 : x13 x0 : x14 x0 : x15 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 : x0 x0: x1 x0 : x2 x0: x3 x0 : x4 x0: x5 x0 : x6 x0: x7 x0 : x8 x0: x9 x0 : x10 x0: x11 x0 : x12 x0: x13 x0 : x14 x0: x15 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10 x0 :x11 x0 :x12 x0 :x13 x0 :x14 x0 :x15 Upsweep Downsweep x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10 x0 :x11 x0 :x12 x0 :x13 x0 :x14 x0 :x15
(a)A radix-4 Brent-Kung scan-then-propagate (b)A raking radix-4 reduce-then-scan strategy (c)A radix-4 Sklansky strategy embedding strategy embedding Kogge-Stone warpscans embedding serial reductions,serial scans,and Kogge-Stone warpscans and propagation and propagation fans a Kogge-Stone warpscan at the root fans Fig.2.GPU hybrid block-scan strategies for n=16 and SIMD warp size 4. Rounded boxes illustrate warp assignments. hybrid design strategies that efficiently utilize the entire hierarchy each suitable for different-sized data types and scan operators of of multiprocessor and memory organization. different complexities.For illustrative purposes,we depict blocks of 16 threads comprised of 4-thread SIMD warps,with one item 3.Contemporary GPU scan strategies per thread.Increasing the thread-granularity would entail Every GPU kernel is executed by a hierarchically-organized wrapping the constructs with an outer layer of serial intra-thread collection of threads.The individual threads executing the kernel reductions and scans.Dotted lines indicate barrier-synchronized function are grouped into thread blocks,and these blocks are communication through shared-memory grouped into kernel grids.The GPU uses thread blocks to Fig.2a presents a radix-4 Brent-Kung scan-then-propagate efficiently manage the population of running threads on hardware strategy that embeds 4-item Kogge-Stone warpscans and multiprocessors(processor cores).Threads within the same block propagation fans.At the root of the hourglass,one of the warps is can cooperate through fast on-chip scratch memory and barrier repurposed to warpscan the partial totals from each warp.Only synchronization. two block-wide barriers are needed if the width of the SIMD warp Prefix scan implementations designed for GPUs are typically is greater than the number of warps in the block.This block-wide constructed from two levels of organization:(1)a global,coarse- scan technique was first demonstrated in CUDPP [23]and is one grained strategy for computing a device-wide scan using radix-b of several block-scan components available in CUB [21]. In this sub-networks for {reduction scan propagation;and (2)a set of example,the depth is 5 levels and size is 37 operators. local,fine-grained strategies for computing b-input reduction Fig.2b presents a radix-4"raking"reduce-then-scan strategy scan propagation)within each thread block.We discuss the in which the entire computation is delegated to a single warp" latter first. With 4 items per thread,the delegate warp performs efficient serial upsweep and downsweep networks within the registers of 3.1 Block-wide scan strategies each thread.The root of the hourglass comprises a Kogge-Stone warpscan.Only two block-wide barriers are needed.This type of The blocking factor b is a function of thread block size and thread raking block-wide scan technique was first demonstrated in grain size,both of which are constrained by the physical resources MatrixScan [12]and is one of several block-scan components of the multiprocessor and can be considered tunable constants.In available in CUB [21].Although the network is relatively deep practice,the tiles of block input are typically several thousand (9 levels),this approach exhibits minimal inter-thread items each (e.g.,b=2048 items for 128 threads with 16 items per communication and is very efficient (only 29 operators). thread).Thus tile size b is uncorrelated to global input size n,and Fig.2c presents a radix-4 Sklansky strategy in which 4-input the asymptotic work-(in)efficiency of any particular block-wide Kogge-Stone warpscans are coupled with 4-output Sklansky fan scan construction will not affect that of the global strategy propagation.It is also provided as a block-scan component within Because we desire memory-bound kernels,the primary CUB [211.Unlike the previous two strategies,the number of performance goal of any underlying block-wide strategy is to be block-wide barriers increases with the log of the number of warps efficient enough such that the local computational overhead(inter- In this example,the depth is 4 levels and size is 36 operators. thread communication,synchronization,and scan operators)can be absorbed by an oversubscribed memory subsystem.In other 3.2 Global scan strategies words,the computational overhead for block-wide scan must be less than the I/O overhead for the thread block.Improving block- Historically,GPU scan implementations have primarily embodied wide scan efficiency beyond this point may reduce power one of three radix-b strategies at the global level:scan-then- consumption,but will not affect overall performance For propagate,reduce-then-scan,or chained-scan. traditional prefix-sum kemels,the block's I/O overhead is dictated by 2b data movement:b items read and written.For allocation and compaction scenarios,this efficiency constraint can be twice 2 This can also be considered a variation of the Han-Carlson network as tight if no output items are actually produced. In CUB,we construction [15]in which the root of a Brent-Kung construction is simply replaced with a Kogge-Stone network. treat the selection of local scan algorithm as a tuning parameter. Fig.2 illustrates several commonplace hybrid strategies for 3"Raking"is a parallel decomposition in which each thread consumes a block-wide scan.Their different circuit sizes and depths make non-overlapping partition of consecutive items.It is visually reminiscent ofthe tines of a rake being dragged at an angle along the ground [6]. 3
3 hybrid design strategies that efficiently utilize the entire hierarchy of multiprocessor and memory organization. 3. Contemporary GPU scan strategies Every GPU kernel is executed by a hierarchically-organized collection of threads. The individual threads executing the kernel function are grouped into thread blocks, and these blocks are grouped into kernel grids. The GPU uses thread blocks to efficiently manage the population of running threads on hardware multiprocessors (processor cores). Threads within the same block can cooperate through fast on-chip scratch memory and barrier synchronization. Prefix scan implementations designed for GPUs are typically constructed from two levels of organization: (1) a global, coarsegrained strategy for computing a device-wide scan using radix-b sub-networks for {reduction | scan | propagation}; and (2) a set of local, fine-grained strategies for computing b-input {reduction | scan | propagation} within each thread block. We discuss the latter first. 3.1 Block-wide scan strategies The blocking factor b is a function of thread block size and thread grain size, both of which are constrained by the physical resources of the multiprocessor and can be considered tunable constants. In practice, the tiles of block input are typically several thousand items each (e.g., b = 2048 items for 128 threads with 16 items per thread). Thus tile size b is uncorrelated to global input size n, and the asymptotic work-(in)efficiency of any particular block-wide scan construction will not affect that of the global strategy. Because we desire memory-bound kernels, the primary performance goal of any underlying block-wide strategy is to be efficient enough such that the local computational overhead (interthread communication, synchronization, and scan operators) can be absorbed by an oversubscribed memory subsystem. In other words, the computational overhead for block-wide scan must be less than the I/O overhead for the thread block. Improving blockwide scan efficiency beyond this point may reduce power consumption, but will not affect overall performance. For traditional prefix-sum kernels, the block’s I/O overhead is dictated by 2b data movement: b items read and written. For allocation and compaction scenarios, this efficiency constraint can be twice as tight if no output items are actually produced. In CUB, we treat the selection of local scan algorithm as a tuning parameter. Fig. 2 illustrates several commonplace hybrid strategies for block-wide scan. Their different circuit sizes and depths make each suitable for different-sized data types and scan operators of different complexities. For illustrative purposes, we depict blocks of 16 threads comprised of 4-thread SIMD warps, with one item per thread. Increasing the thread-granularity would entail wrapping the constructs with an outer layer of serial intra-thread reductions and scans. Dotted lines indicate barrier-synchronized communication through shared-memory. Fig. 2a presents a radix-4 Brent-Kung scan-then-propagate strategy that embeds 4-item Kogge-Stone warpscans and propagation fans. At the root of the hourglass, one of the warps is repurposed to warpscan the partial totals from each warp2 . Only two block-wide barriers are needed if the width of the SIMD warp is greater than the number of warps in the block. This block-wide scan technique was first demonstrated in CUDPP [23] and is one of several block-scan components available in CUB [21]. In this example, the depth is 5 levels and size is 37 operators. Fig. 2b presents a radix-4 “raking” reduce-then-scan strategy in which the entire computation is delegated to a single warp 3 . With 4 items per thread, the delegate warp performs efficient serial upsweep and downsweep networks within the registers of each thread. The root of the hourglass comprises a Kogge-Stone warpscan. Only two block-wide barriers are needed. This type of raking block-wide scan technique was first demonstrated in MatrixScan [12] and is one of several block-scan components available in CUB [21]. Although the network is relatively deep (9 levels), this approach exhibits minimal inter-thread communication and is very efficient (only 29 operators). Fig. 2c presents a radix-4 Sklansky strategy in which 4-input Kogge-Stone warpscans are coupled with 4-output Sklansky fan propagation. It is also provided as a block-scan component within CUB [21]. Unlike the previous two strategies, the number of block-wide barriers increases with the log of the number of warps. In this example, the depth is 4 levels and size is 36 operators. 3.2 Global scan strategies Historically, GPU scan implementations have primarily embodied one of three radix-b strategies at the global level: scan-thenpropagate, reduce-then-scan, or chained-scan. 2 This can also be considered a variation of the Han-Carlson network construction [15] in which the root of a Brent-Kung construction is simply replaced with a Kogge-Stone network. 3 “Raking” is a parallel decomposition in which each thread consumes a non-overlapping partition of consecutive items. It is visually reminiscent of the tines of a rake being dragged at an angle along the ground [6]. (a) A radix-4 Brent-Kung scan-then-propagate strategy embedding Kogge-Stone warpscans and propagation fans (b) A raking radix-4 reduce-then-scan strategy embedding serial reductions, serial scans, and a Kogge-Stone warpscan at the root (c) A radix-4 Sklansky strategy embedding Kogge-Stone warpscans and propagation fans Fig. 2. GPU hybrid block-scan strategies for n = 16 and SIMD warp size 4. Rounded boxes illustrate warp assignments. x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10x0 :x11 x0 :x12x0 :x13 x0 :x14x0 :x15 barrier barrier x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10x0 :x11 x0 :x12x0 :x13 x0 :x14x0 :x15 x12 x13 x14 x15 barrier barrier x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10x0 :x11 x0 :x12x0 :x13 x0 :x14x0 :x15 barrier barrier
Fig.3.Three-kernel reduce-then-scan parallelization among G thread blocks(~3n global data movement) Fig.4.Single-pass chained-scan prefix scan among G thread blocks(~2n global data movement) reduce reduce scan Status flog:[PIAIX) Status flag:IP1Alx】 Status flag:【PIAIX】 Status flag:IP1Ax】 Fig.5.Single-pass adaptive look-back prefix scan among G thread blocks(~2n global data movement) Scan-then-propagate.The global scan implementations actively resident on the processor (and is uncorrelated to n).In within CUDPP [10,23]and Thrust [3]are examples of high-radix the first kernel,each thread block reduces the tiles of its partition Brent-Kung data flow,recursively dispatching kernels of block- in an iterative,serial fashion.Then the small list of G block- sized scan networks followed by kernels of block-sized fan aggregates is itself scanned.In the third kernel,each thread block propagation.Discounting the negligible I/O of inner levels,they iteratively computes a prefix scan across the tiles of its partition, incur ~4n global data movement,with the outermost kernels seeded with the appropriate block-prefix computed by the scan of reading and writing ~n items each. block-aggregates.By switching the behavior of the first upsweep Reduce-then-scan.The global scan implementations within thread block from reduction to scan,Ha and Han are able to elide MatrixScan [12],B40C [1],MGPU [2],and by Ha and Han [16] the last block of the upsweep kernel and the first block of the are examples of reduce-then-scan dataflow,dispatching kernels of downsweep kernel.The global data movement is~3n(~2n items block-sized reduction networks followed by kernels of block- read,~n items written). sized scan networks.MatrixScan does this recursively,whereas Chained-scan.As an alternative,the chained-scan the other implementations employ a raking strategy in which the parallelization [27]is a single-pass approach in which thread upsweep and downsweep thread blocks process multiple input blocks are each assigned a tile of input,and a serial dependence tiles each,necessitating only a single root scan kernel. chain exists between thread blocks.Each thread block will wait As illustrated in Fig 3,the input is partitioned evenly among on the inclusive prefix of its predecessor to become available. G thread blocks,where G is the number of blocks that can be The global data movement is~2n(n items read,n items written)
4 Scan-then-propagate. The global scan implementations within CUDPP [10, 23] and Thrust [3] are examples of high-radix Brent-Kung data flow, recursively dispatching kernels of blocksized scan networks followed by kernels of block-sized fan propagation. Discounting the negligible I/O of inner levels, they incur ~4n global data movement, with the outermost kernels reading and writing ~n items each. Reduce-then-scan. The global scan implementations within MatrixScan [12], B40C [1], MGPU [2], and by Ha and Han [16] are examples of reduce-then-scan dataflow, dispatching kernels of block-sized reduction networks followed by kernels of blocksized scan networks. MatrixScan does this recursively, whereas the other implementations employ a raking strategy in which the upsweep and downsweep thread blocks process multiple input tiles each, necessitating only a single root scan kernel. As illustrated in Fig. 3, the input is partitioned evenly among G thread blocks, where G is the number of blocks that can be actively resident on the processor (and is uncorrelated to n). In the first kernel, each thread block reduces the tiles of its partition in an iterative, serial fashion. Then the small list of G blockaggregates is itself scanned. In the third kernel, each thread block iteratively computes a prefix scan across the tiles of its partition, seeded with the appropriate block-prefix computed by the scan of block-aggregates. By switching the behavior of the first upsweep thread block from reduction to scan, Ha and Han are able to elide the last block of the upsweep kernel and the first block of the downsweep kernel. The global data movement is ~3n (~2n items read, ~n items written). Chained-scan. As an alternative, the chained-scan parallelization [27] is a single-pass approach in which thread blocks are each assigned a tile of input, and a serial dependence chain exists between thread blocks. Each thread block will wait on the inclusive prefix of its predecessor to become available. The global data movement is ~2n (n items read, n items written). Fig. 3. Three-kernel reduce-then-scan parallelization among G thread blocks (~3n global data movement) Fig. 4. Single-pass chained-scan prefix scan among G thread blocks (~2n global data movement) Fig. 5. Single-pass adaptive look-back prefix scan among G thread blocks (~2n global data movement) reduce reduce x0 xb-1 … xb B0 reduce reduce reduce … … reduce … B1 reduce reduce reduce … reduce … B2 reduce reduce reduce … reduce … BG-1 reduce reduce reduce … reduce … B1 … scan … … reduce … B2 … scan scan scan … … reduce x0 xb-1 … xb B0 … scan scan scan … … y0 yb-1 yb B0 reduce … BG-1 … scan scan scan … … scan … upsweep pass downsweep pass root scan scan scan B0 B1 B2 BG-1 x0 x1 x2 y0 y1 y2 … … … prefix0:0 prefix0:1 prefix0:2 prefix0:p-2 reduce reduce reduce reduce scan scan scan scan B0 B1 B2 BG-1 x0 x1 x2 y0 y1 y2 … … aggregate0 incl-prefix0 decoupling look-back aggregate1 incl-prefix1 aggregate2 incl-prefix2 reduce reduce reduce reduce scan scan scan scan Status flag: {P|A|X} Status flag: {P|A|X} Status flag: {P|A|X} Status flag: {P|A|X}
As illustrated in Fig.4,the local block-wide scans are typically inclusive prefix.Used to record the partition's implemented using a parallel reduce-then-scan strategy.This inclusive prefix as computed by reducing aggregate allows the thread blocks to perform local upsweep and with the accumulated look-back from preceding downsweep work in parallel,each core only having to wait at its partitions. phase-transition for the required prefix to be produced by its status flag.The flag describes the partition's current predecessor.The running prefix is then aggregated into the block- status as one of the following: wide downsweep. Chained-scan's performance is limited by the latency of signal A-aggregate available.Indicates the aggregate propagation between thread blocks.For example,suppose a field has been recorded for the associated partition. NVIDIA Tesla C2050 GPU processor with 144GB/sec memory P-prefix available.Indicates the inclusive prefix bandwidth,processor cores operating at 1.15GHz,and a latency of field has been recorded for the associated partition. 600 clock cycles to pass a message from one core to another through the memory hierarchy.This signaling latency limits the X-invalid.Indicates no information about the system to a scan throughput of 1.9M partitions/sec.If each core is partition is available to other processors. All assigned a partition of 256 inputs(32-bits each),the maximum- descriptors are initialized with status X. achievable scan throughput will be limited to 490M inputs/sec. 2 Synchronize.All processors synchronize to ensure a This is well short of the theoretical memory bandwidth limitation consistent view of initialized partition descriptors. of limit of 18B inputs/sec for simply reading and writing the data. Compute and record the partition-wide aggregate.Each When limited by partition-signaling,one solution is to increase processor computes and records its partition-wide the partition size.(In the previous example,a partition size of aggregate to the corresponding partition descriptor.It 9000 inputs per thread block would be required to saturate then executes a memory fence and updates the descriptor's memory bandwidth.However,it may be impractical to increase status flag to A.Furthermore,the processor owning the the partition size arbitrarily.Many processor designs have limited storage capacity per core.For example,GPU processors provide first partition copies aggregate to the inclusive prefix field,updates status flag to P,and skips to Step 6 below. virtualized cores whose performance is optimized when the working set fits completely within on-chip register file and shared 4 Determine the partition's exclusive prefix using memory resources.Our method,however,is insensitive to decoupled look-back.Each processor initializes and partition size and is therefore amenable to diverse architectural maintains a running exclusive prefix as it progressively configurations. inspects the descriptors of increasingly antecedent partitions,beginning with the immediately preceding 4.Decoupled Lookback partition. For each predecessor,the processor will conditionally perform the following based upon the 4.1 Operation predecessor's status_flag: Our method is a generalization of the chained-scan approach with X--Block (or continue polling)until the status flag is dramatically reduced prefix propagation latencies.The idea is to not. decouple the singular dependence of each processor on its A--The predecessor's aggregate field is added to immediate predecessor at the expense of progressively redundant exclusive prefix and the processor continues on to computation.Whereas the chained-scan approach has a fixed inspect the preceding tile. "look-back"of one partition,our method allows processors to inspect the status of predecessors that are increasingly further P-The predecessor's inclusive prefix field is added to away.This is illustrated in Fig.5. exclusive_prefix and the look-back phase is terminated. As each partition is processed,its status is updated with the 5. Compute and record the partition-wide inclusive partition-wide aggregate.Each aggregate is simply the reduction prefixes.Each processor adds exclusive prefix to of items within partition,and can be computed independently aggregate and records the result to the descriptor's from other partitions.Because the aggregates are readily inclusive prefix field.It then executes a memory fence available,each processor is generally free to peruse the aggregates and updates the descriptor's status flag to P. recorded for preceding partitions,progressively accumulating them until complete exclusive prefix is known (the running total 6. Perform a partition-wide scan seeded with the partition's across all prior partitions).The partition's status is then updated exclusive prefix.Each processor completes the scan of its with the inclusive prefix,which is computed from the exclusive partition,incorporating exclusive_prefix into every output value prefix and the partition-wide aggregate.The exclusive prefix is then used to begin the partition's downsweep scan phase. The computation of each processor can proceed independently Furthermore,an opportunistic encounter with a predecessor's and in parallel with other processors throughout steps 1,3,5,and inclusive prefix permits early-termination of the look-back phase. 6.In Step 4(decoupled look-back),each processor must wait on More specifically,each parallel processor within our method its predecessor(s)to finish Step 3 (record partition-wide operates as follows: reduction). 1. Initialize the partition descriptor.Each processor's partition is allocated a status descriptor having the 4.2 Properties following fields: Our method has the following properties: aggregate.Used to record the partition-wide Safery.The algorithm will run to completion if the system aggregate as computed by the upsweep phase of guarantees forward-progress for all processors.Forward partition-scan. progress ensures that no processor will wait indefinitely in Step 4:every predecessor is free to record its aggregate in 5
5 As illustrated in Fig. 4, the local block-wide scans are typically implemented using a parallel reduce-then-scan strategy. This allows the thread blocks to perform local upsweep and downsweep work in parallel, each core only having to wait at its phase-transition for the required prefix to be produced by its predecessor. The running prefix is then aggregated into the blockwide downsweep. Chained-scan’s performance is limited by the latency of signal propagation between thread blocks. For example, suppose a NVIDIA Tesla C2050 GPU processor with 144GB/sec memory bandwidth, processor cores operating at 1.15GHz, and a latency of 600 clock cycles to pass a message from one core to another through the memory hierarchy. This signaling latency limits the system to a scan throughput of 1.9M partitions/sec. If each core is assigned a partition of 256 inputs (32-bits each), the maximumachievable scan throughput will be limited to 490M inputs/sec. This is well short of the theoretical memory bandwidth limitation of limit of 18B inputs/sec for simply reading and writing the data. When limited by partition-signaling, one solution is to increase the partition size. (In the previous example, a partition size of 9000 inputs per thread block would be required to saturate memory bandwidth.) However, it may be impractical to increase the partition size arbitrarily. Many processor designs have limited storage capacity per core. For example, GPU processors provide virtualized cores whose performance is optimized when the working set fits completely within on-chip register file and shared memory resources. Our method, however, is insensitive to partition size and is therefore amenable to diverse architectural configurations. 4. Decoupled Lookback 4.1 Operation Our method is a generalization of the chained-scan approach with dramatically reduced prefix propagation latencies. The idea is to decouple the singular dependence of each processor on its immediate predecessor at the expense of progressively redundant computation. Whereas the chained-scan approach has a fixed “look-back” of one partition, our method allows processors to inspect the status of predecessors that are increasingly further away. This is illustrated in Fig. 5. As each partition is processed, its status is updated with the partition-wide aggregate. Each aggregate is simply the reduction of items within partition, and can be computed independently from other partitions. Because the aggregates are readily available, each processor is generally free to peruse the aggregates recorded for preceding partitions, progressively accumulating them until complete exclusive prefix is known (the running total across all prior partitions). The partition’s status is then updated with the inclusive prefix, which is computed from the exclusive prefix and the partition-wide aggregate. The exclusive prefix is then used to begin the partition’s downsweep scan phase. Furthermore, an opportunistic encounter with a predecessor’s inclusive prefix permits early-termination of the look-back phase. More specifically, each parallel processor within our method operates as follows: 1. Initialize the partition descriptor. Each processor’s partition is allocated a status descriptor having the following fields: aggregate. Used to record the partition-wide aggregate as computed by the upsweep phase of partition-scan. inclusive_prefix. Used to record the partition’s inclusive prefix as computed by reducing aggregate with the accumulated look-back from preceding partitions. status_flag. The flag describes the partition’s current status as one of the following: A – aggregate available. Indicates the aggregate field has been recorded for the associated partition. P – prefix available. Indicates the inclusive_prefix field has been recorded for the associated partition. X – invalid. Indicates no information about the partition is available to other processors. All descriptors are initialized with status X. 2. Synchronize. All processors synchronize to ensure a consistent view of initialized partition descriptors. 3. Compute and record the partition-wide aggregate. Each processor computes and records its partition-wide aggregate to the corresponding partition descriptor. It then executes a memory fence and updates the descriptor’s status_flag to A. Furthermore, the processor owning the first partition copies aggregate to the inclusive_prefix field, updates status_flag to P, and skips to Step 6 below. 4. Determine the partition’s exclusive prefix using decoupled look-back. Each processor initializes and maintains a running exclusive_prefix as it progressively inspects the descriptors of increasingly antecedent partitions, beginning with the immediately preceding partition. For each predecessor, the processor will conditionally perform the following based upon the predecessor’s status_flag: X -- Block (or continue polling) until the status_flag is not X. A -- The predecessor’s aggregate field is added to exclusive_prefix and the processor continues on to inspect the preceding tile. P -- The predecessor’s inclusive_prefix field is added to exclusive_prefix and the look-back phase is terminated. 5. Compute and record the partition-wide inclusive prefixes. Each processor adds exclusive_prefix to aggregate and records the result to the descriptor’s inclusive_prefix field. It then executes a memory fence and updates the descriptor’s status_flag to P. 6. Perform a partition-wide scan seeded with the partition’s exclusive prefix. Each processor completes the scan of its partition, incorporating exclusive_prefix into every output value. The computation of each processor can proceed independently and in parallel with other processors throughout steps 1, 3, 5, and 6. In Step 4 (decoupled look-back), each processor must wait on its predecessor(s) to finish Step 3 (record partition-wide reduction). 4.2 Properties Our method has the following properties: Safety. The algorithm will run to completion if the system guarantees forward-progress for all processors. Forward progress ensures that no processor will wait indefinitely in Step 4: every predecessor is free to record its aggregate in