文件格式: PDF大小: 499.04KB页数: 21

nVIDIA Parallel Prefix Sum (Scan)with CUDA Mark Harris mharris@nvidia.com April 2007

April 2007 Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com

Document Change History Version Date Responsible Reason for Change February 14, Mark Harris Initial release 2007 April 2007

April 2007 ii Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release

Abstract Parallel prefix sum,also known as parallel Scan,is a useful building block for many parallel algorithms including sorting and building data structures.In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA.We start with a basic naive algorithm and proceed through more advanced techniques to obtain best performance.We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads. Month 2007 1

Month 2007 1 Abstract Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads

Parallel Prefix Sum (Scan)with CUDA Table of Contents Abstract....... .1 Table of Contents....... .2 Introduction...... .3 Inclusive and Exclusive Scan............ 3 Sequential Scan... 4 A Naive Parallel Scan............ 4 A Work-Efficient Parallel Scan............... .7 Avoiding Bank Conflicts............. .11 Arrays of Arbitrary Size......... 14 Performance.a… .16 Conclusion... .17 Bibliography........ .18 April 2007 2

Parallel Prefix Sum (Scan) with CUDA April 2007 2 Table of Contents Abstract..............................................................................................................1 Table of Contents...............................................................................................2 Introduction.......................................................................................................3 Inclusive and Exclusive Scan .........................................................................................................................3 Sequential Scan.................................................................................................................................................4 A Naïve Parallel Scan .........................................................................................4 A Work-Efficient Parallel Scan...........................................................................7 Avoiding Bank Conflicts ...................................................................................11 Arrays of Arbitrary Size....................................................................................14 Performance.....................................................................................................16 Conclusion........................................................................................................17 Bibliography.....................................................................................................18

Parallel Prefix Sum(Scan)with CUDA Introduction A simple and common parallel algorithm building block is the allprefix-sums operation.In this paper we will define and illustrate the operation,and discuss in detail its efficient implementation on NVIDIA CUDA.As mentioned by Blelloch [1],all-prefix-sums is a good example of a computation that seems inherently sequential,but for which there is an efficient parallel algorithm.The all-prefix-sums operation is defined as follows in [1]: Definition:The all-prefix-sums operation takes a binary associative operator and an array of n elements [ao,a,...,a and returns the array [a,(⊕a),,(⊕4©.⊕a-l Example:If is addition,then the all-prefix-sums operation on the array [317041631, would return [34111114162225. There are many uses for all-prefix-sums,including,but not limited to sorting,lexical analysis, string comparison,polynomial evaluation,stream compaction,and building histograms and data structures(graphs,trees,etc.)in parallel.For example applications,we refer the reader to the survey by Blelloch [1]. In general,all-prefix-sums can be used to convert some sequential computations into equivalent,but parallel,computations as shown in Figure 1. out[0]=0 forall j in parallel do forall j from 1 to n do temp[j]f(in[j]); out[j]outlj-1]f(in[j-1)); all prefix sums (out,temp); Figure 1:A sequential computation and its parallel equivalent. Inclusive and Exclusive Scan All-prefix-sums on an array of data is commonly known as scan.We will use this simpler terminology (which comes from theAPL programming language [11)for the remainder of this paper.As shown in the last section,a scan of an array generates a new array where each element jis the sum of all elements up to and including j.This is an inclsire scan.It is often useful for each element /in the results of a scan to contain the sum of all previous elements, but not jitself.This operation is commonly known as an exclusire scan(or prescan)[1]. Definition:The exclusive scan operation takes a binary associative operator with identity I,and an array of n elements [,a,,as-1 April 2007 3

Parallel Prefix Sum (Scan) with CUDA April 2007 3 Introduction A simple and common parallel algorithm building block is the all-prefix-sums operation. In this paper we will define and illustrate the operation, and discuss in detail its efficient implementation on NVIDIA CUDA. As mentioned by Blelloch [1], all-prefix-sums is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm. The all-prefix-sums operation is defined as follows in [1]: Definition: The all-prefix-sums operation takes a binary associative operator ⊕, and an array of n elements [a0, a1, …, an-1], and returns the array [a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]. Example: If ⊕ is addition, then the all-prefix-sums operation on the array [3 1 7 0 4 1 6 3], would return [3 4 11 11 14 16 22 25]. There are many uses for all-prefix-sums, including, but not limited to sorting, lexical analysis, string comparison, polynomial evaluation, stream compaction, and building histograms and data structures (graphs, trees, etc.) in parallel. For example applications, we refer the reader to the survey by Blelloch [1]. In general, all-prefix-sums can be used to convert some sequential computations into equivalent, but parallel, computations as shown in Figure 1. out[0] = 0; forall j from 1 to n do out[j] = out[j-1] + f(in[j-1]); forall j in parallel do temp[j] = f(in[j]); all_prefix_sums(out, temp); Figure 1: A sequential computation and its parallel equivalent. Inclusive and Exclusive Scan All-prefix-sums on an array of data is commonly known as scan. We will use this simpler terminology (which comes from the APL programming language [1]) for the remainder of this paper. As shown in the last section, a scan of an array generates a new array where each element j is the sum of all elements up to and including j. This is an inclusive scan. It is often useful for each element j in the results of a scan to contain the sum of all previous elements, but not j itself. This operation is commonly known as an exclusive scan (or prescan) [1]. Definition: The exclusive scan operation takes a binary associative operator ⊕ with identity I, and an array of n elements [a0, a1, …, an-1]

点击进入文档下载页（PDF格式）

点击进入文档下载页（PDF格式）

- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Single-pass Parallel Prefix Scan with Decoupled Look-back
- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Program Optimization Space Pruning for a Multithreaded GPU
- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Some Computer Organizations and Their Effectiveness
- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Software and the Concurrency Revolution
- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems
- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）MPI A Message-Passing Interface Standard（Version 2.2）
- 南京大学：《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源（课件讲稿）19 Firewall Design Methods
- 南京大学：《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源（课件讲稿）18 Web Security（SQL Injection and Cross-Site Request Forgery）
- 南京大学：《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源（课件讲稿）17 Web Security（Cookies and Cross Site Scripting，XSS）
- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Methods of conjugate gradients for solving linear systems
- 《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）NVIDIA CUDA C Programming Guide（Design Guide，June 2017）
- 电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 01 Introduction To Cuda C
- 电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 02 CUDA PARALLELISM MODEL
- 电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 03 MEMORY AND DATA LOCALITY
- 电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 04 Performance considerations
- 电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 05 PARALLEL COMPUTATION PATTERNS（HISTOGRAM）
- 电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 06 PARALLEL COMPUTATION PATTERNS（SCAN）
- 电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 07 JOINT CUDA-MPI PROGRAMMING
- 电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 08 Parallel Sparse Methods