Table of contents
Parallel programming model - Wikipedia
Reduction 归约
(2024-03-01)
- Source video: CUDA编程模型系列八(原子操作 / 规约 / 向量元素求和) - Ken He
- Code
The parallism design depends on the operations to be performed. For example, given a task N-number summation, the operation executed on each thread in parallel is addition.
-
Threads reduce by half every time.
As the “plus” operation computes 2 numbers, the data sequence is bisected.
1 2 3 4 5 6 7 8 9 10 11source[8]: 0 1 2 3 4 5 6 7 step 1: thread 0: source[0] + source[4] -> source[0] thread 1: source[1] + source[5] -> source[1] thread 2: source[2] + source[6] -> source[2] thread 3: source[3] + source[7] -> source[3] step 2: thread 0: source[0] + source[2] -> source[0] thread 1: source[1] + source[3] -> source[1] step 3: thread 0: source[0] + source[1]As shown above, the number of inital threads allocated is a half of the total data items. And in the following steps, the number of launched threads is a half of the number of threads used last time.
Specificaly, in the 1st round, 4 threads for 8 items, and the 2nd round only uses 2 threads for 4 results of the last step, and the final round only uses 1 thread.
If using CPU, there will be 7 plus operation, however, on GPU, there are only 3 steps.
-
When the total number of operations is larger than the allocated threads, the Grid stride loop trick can be used.
-
For example, there are 32 operations need to be executed, but only 8 threads are allocated. Therefore, each thread has to be reused 4 times.
Based on this fact, accumulate the 4 loops at first and then perform summation within a block (8 threads).
-
Only consider the behavior of one thread: what values will it use?
For a thread in a block, the sum of values assigned to it during 4 loops is computed as:
In this way, multiple steps are compressed into a single block (8 threads).
-
-
Shared memory is very fast, so it can be used for those memory that is frequently accessed.
As the accumulated sums of 4 loops for each thread requires summation across the
BLOCK_SIZEat the end, they can be stored in shared memory for later frequent reading.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15// grid loop: accumulate loops first // allocate the same size as a block __shared__ int acc_tmp[BLOCK_SIZE]; int shared_tmp = 0; // Necessary to get correct result // Each thread adds the thread after n_thrd_cur for(int ele_id=blockDim.x * blockIdx.x + threadIdx.x; ele_id<num_items; ele_id+=blockDim.x*gridDim.x){ shared_tmp += d_in[ele_id]; } // __syncthreads(); // Sometimes lead to wrong result acc_tmp[threadIdx.x] = shared_tmp; // assign shared mem __syncthreads(); // NecessaryNote:
-
If directly using the shared memory to do accumulation like:
acc_tmp[threadIdx.x] += d_in[ele_id];, the result could be wrong because it’s in a loop, where mutliple thread may access the same memory at the same time, due to shared memory is accessible for all threads in a block. -
However, the local variable (
shared_tmp) reside in register is private for a thread, and other threads can’t access it. So modifying theshared_tmpis safe and necessary. -
The last
__syncthreads();cannot be put right after theforloop, and must be after shared memory assignment (on 1050Ti). Otherwise, the result could be wrong. (1080Ti is ok.)
So far, only the “block” of shared memory
acc_tmpneeds to compute the sum.The threads reduction is performed through a
forloop to adjust the number of threads step-by-step:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23// Sum numbers in the shared memory with size of BLOCK_SIZE // Threads reduce by half each step // Initial number of threads is a half of the total data for (n_thrd_cur=BLOCK_SIZE/2; n_thrd_cur>=1; n_thrd_cur/=2){ // Let a thread to do an operations: plus // Temporary variable is necessary for memory safety: int sum_tmp = 0; // Only use threads required if (threadIdx.x < n_thrd_cur){ sum_tmp = acc_tmp[threadIdx.x] + acc_tmp[threadIdx.x + n_thrd_cur]; } __syncthreads(); // Necessary // as write after read for the same memory // Can't reside in if or other brach syntax // Write result back to memory if (threadIdx.x < n_thrd_cur){ acc_tmp[threadIdx.x] = sum_tmp; } __syncthreads(); // Necessary for 1050Ti, 1080Ti }Finally, the sum of a block (shared memory) is stored in
acc_tmp[0]. -
-
__syncthreads()is used when a memory is read followed by writing/modification to avoid data Hazard-wiki (Race condition-wiki, Memory safety-wiki)__syncthreads()can’t reside inifbecause it’s a branch. Otherwise, when multiple threads run in parallel, threads may go different branches, consequently, leading to errors. -
atomicAddguarantees the read/write to an address won’t be disrupted by other threads.When adding the summation of each block
acc_tmp[0](shared memory) uptod_out(global memory), multiple threads access the same global memoryd_out, soatomicAddis applied:1 2 3 4 5 6 7 8// Accumulate all blocks in the grid // The sum of each block was stored in acc_tmp[0] // Each block uses 1 thread to add its sum to the total sum of all blocks if (blockIdx.x * blockDim.x < num_items){ if (threadIdx.x == 0){ atomicAdd(d_out, acc_tmp[0]); } }
Full code:
|
|
Output:
|
|