4 (a)serial (chained scan) (b)Kogge-Stone (c)Sklansky (d)Brent-Kung (e)Reduce-then-scan Fig.1.Commonplace scan constructions for n=16. Dashed boxes illustrate recursive construction. suggests that as circuit depth is increasingly constrained,DSO Whereas minimum-depth circuits are fast when the input size networks no longer exist and the size of the networks increases is less than or equal to the width of the underlying multiprocessor, rapidly. minimum-size networks are important for larger problems.Many Fig.I presents commonplace scan networks relevant to parallel programming models virtualize the underlying physical contemporary prefix scan algorithms.Although the serial,or processors,causing overall runtime to scale with circuit size chained scan,construction in Fig.la has maximal n-1 depth and instead of depth.Therefore work-efficiency is often a practical no concurrent computations,its minimal n-l size makes it an design objective for general-purpose prefix scan algorithms. attractive subcomponent of scan networks designed for The Brent-Kung construction [8]in Fig.Id (and corresponding oversubscribed processors. Increasing the computational Blelloch algorithm [4,5])is a work-efficient strategy having granularity (i.e.,items per thread)is a common technique for 2logan depth and O(n)size.Visually,the data flow resembles an improving processor utilization by reducing inter-thread "hourglass"shape comprising(1)an upsweep accumulation tree communication. having progressively less parallelism,and (2)a downsweep The Kogge-Stone construction [19]in Fig.Ib (and propagation tree exhibiting progressively more parallelism. corresponding Hillis-Steele algorithm [17))is a well-known, Generalizing the binary operators in the upsweep with radix-b minimum-depth network that uses a recursive-doubling approach scans and those in the downsweep with radix-b fans,the Brent- for aggregating partial reductions.Despite having inefficient Kung strategy exhibits a more pronounced scan-then-propagate O(nlogan)work complexity,its shallow depth and simple shared behavior (as illustrated in Fig 2a). memory address computations make it an attractive strategy for For programming models that virtualize an unlimited number SIMD architectures(e.g.,GPU warps)where inactive processor of processors,a concern with scan-then-propagate data flow is resources cannot be scavenged' that~n live values are spilled and filled through last-level memory The Sklansky construction [24]in Fig.Ic employs a recursive, between upsweep and downsweep phases when the input exceeds scan-then-fan approach that also achieves minimum depth logan at on-chip memory.To eliminate the writes,we can simply the expense of O(nlog n)work complexity.Compared to Kogge- rematerialize the intermediates during the downsweep phase at the Stone constructions,these networks exhibit high-radix fan-out expense of O(n)redundant calculations,as shown in Fig.le [9. when propagating partial prefixes computed by recursive 12,22].This has the effect of converting downsweep behavior subgraphs.This improved sharing leads to smaller circuit sizes from propagation to scan. We refer to this adaptation as the and reduced memory bandwidth overheads. reduce-then-scan strategy. In general,an important property of recursive network design is the ability to mix-and-match different strategies at different levels.Further variation is also possible through operator IKogge-Stone "warpscans"are typical of GPU implementations where generalization:whereas these binary operators compute radix-2 (1)SIMD-synchronicity has historically enabled efficient barrier-free scans and fans,network height can be reduced using radix-b communication,and(2)the hardware provisions a "shuffle"crossbar for subcomponents as building blocks [18].This flexibility allows for efficient inter-warp communication.2 suggests that as circuit depth is increasingly constrained, DSO networks no longer exist and the size of the networks increases rapidly. Fig. 1 presents commonplace scan networks relevant to contemporary prefix scan algorithms. Although the serial, or chained scan, construction in Fig. 1a has maximal n-1 depth and no concurrent computations, its minimal n-1 size makes it an attractive subcomponent of scan networks designed for oversubscribed processors. Increasing the computational granularity (i.e., items per thread) is a common technique for improving processor utilization by reducing inter-thread communication. The Kogge-Stone construction [19] in Fig. 1b (and corresponding Hillis-Steele algorithm [17]) is a well-known, minimum-depth network that uses a recursive-doubling approach for aggregating partial reductions. Despite having inefficient O(nlog2n) work complexity, its shallow depth and simple shared memory address computations make it an attractive strategy for SIMD architectures (e.g., GPU warps) where inactive processor resources cannot be scavenged1 . The Sklansky construction [24] in Fig. 1c employs a recursive, scan-then-fan approach that also achieves minimum depth log2n at the expense of O(nlog2n) work complexity. Compared to KoggeStone constructions, these networks exhibit high-radix fan-out when propagating partial prefixes computed by recursive subgraphs. This improved sharing leads to smaller circuit sizes and reduced memory bandwidth overheads. 1 Kogge-Stone “warpscans” are typical of GPU implementations where (1) SIMD-synchronicity has historically enabled efficient barrier-free communication, and (2) the hardware provisions a “shuffle” crossbar for efficient inter-warp communication. Whereas minimum-depth circuits are fast when the input size is less than or equal to the width of the underlying multiprocessor, minimum-size networks are important for larger problems. Many parallel programming models virtualize the underlying physical processors, causing overall runtime to scale with circuit size instead of depth. Therefore work-efficiency is often a practical design objective for general-purpose prefix scan algorithms. The Brent-Kung construction [8] in Fig. 1d (and corresponding Blelloch algorithm [4, 5]) is a work-efficient strategy having 2log2n depth and O(n) size. Visually, the data flow resembles an “hourglass” shape comprising (1) an upsweep accumulation tree having progressively less parallelism, and (2) a downsweep propagation tree exhibiting progressively more parallelism. Generalizing the binary operators in the upsweep with radix-b scans and those in the downsweep with radix-b fans, the BrentKung strategy exhibits a more pronounced scan-then-propagate behavior (as illustrated in Fig. 2a). For programming models that virtualize an unlimited number of processors, a concern with scan-then-propagate data flow is that ~n live values are spilled and filled through last-level memory between upsweep and downsweep phases when the input exceeds on-chip memory. To eliminate the writes, we can simply rematerialize the intermediates during the downsweep phase at the expense of O(n) redundant calculations, as shown in Fig. 1e [9, 12, 22]. This has the effect of converting downsweep behavior from propagation to scan. We refer to this adaptation as the reduce-then-scan strategy. In general, an important property of recursive network design is the ability to mix-and-match different strategies at different levels. Further variation is also possible through operator generalization: whereas these binary operators compute radix-2 scans and fans, network height can be reduced using radix-b subcomponents as building blocks [18]. This flexibility allows for (a) serial (chained scan) (b) Kogge-Stone (c) Sklansky (d) Brent-Kung (e) Reduce-then-scan Fig. 1. Commonplace scan constructions for n = 16. Dashed boxes illustrate recursive construction. x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10 x0 :x11 x0 :x12 x0 :x13 x0 :x14 x0 :x15 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 : x0 x0 : x1 x0 : x2 x0 : x3 x0 : x4 x0 : x5 x0 : x6 x0 : x7 x0 : x8 x0 : x9 x0 : x10 x0 : x11 x0 : x12 x0 : x13 x0 : x14 x0 : x15 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 : x0 x0: x1 x0 : x2 x0: x3 x0 : x4 x0: x5 x0 : x6 x0: x7 x0 : x8 x0: x9 x0 : x10 x0: x11 x0 : x12 x0: x13 x0 : x14 x0: x15 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10 x0 :x11 x0 :x12 x0 :x13 x0 :x14 x0 :x15 Upsweep Downsweep x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x0 :x0 x0 :x1 x0 :x2 x0 :x3 x0 :x4 x0 :x5 x0 :x6 x0 :x7 x0 :x8 x0 :x9 x0 :x10 x0 :x11 x0 :x12 x0 :x13 x0 :x14 x0 :x15