watch: CUDA - 何琨 | CUDA Programming Model

Reduction 归约

(2024-03-01)

Source video: CUDA编程模型系列八(原子操作 / 规约 / 向量元素求和) - Ken He
Code

The parallism design depends on the operations to be performed. For example, given a task N-number summation, the operation executed on each thread in parallel is addition.

Threads reduce by half every time.

As the “plus” operation computes 2 numbers, the data sequence is bisected.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


source[8]: 0  1  2  3  4  5  6  7
step 1:
    thread 0: source[0] + source[4] -> source[0]
    thread 1: source[1] + source[5] -> source[1]
    thread 2: source[2] + source[6] -> source[2]
    thread 3: source[3] + source[7] -> source[3]
step 2:
    thread 0: source[0] + source[2] -> source[0]
    thread 1: source[1] + source[3] -> source[1]
step 3:
    thread 0: source[0] + source[1]

As shown above, the number of inital threads allocated is a half of the total data items. And in the following steps, the number of launched threads is a half of the number of threads used last time.

Specificaly, in the 1st round, 4 threads for 8 items, and the 2nd round only uses 2 threads for 4 results of the last step, and the final round only uses 1 thread.

If using CPU, there will be 7 plus operation, however, on GPU, there are only 3 steps.

When the total number of operations is larger than the allocated threads, the Grid stride loop trick can be used.
- For example, there are 32 operations need to be executed, but only 8 threads are allocated. Therefore, each thread has to be reused 4 times.
  
  Based on this fact, accumulate the 4 loops at first and then perform summation within a block (8 threads).
- Only consider the behavior of one thread: what values will it use?
  
  For a thread in a block, the sum of values assigned to it during 4 loops is computed as:
In this way, multiple steps are compressed into a single block (8 threads).

Shared memory is very fast, so it can be used for those memory that is frequently accessed.

As the accumulated sums of 4 loops for each thread requires summation across the BLOCK_SIZE at the end, they can be stored in shared memory for later frequent reading.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


// grid loop: accumulate loops first
// allocate the same size as a block
__shared__ int acc_tmp[BLOCK_SIZE];

int shared_tmp = 0; // Necessary to get correct result

// Each thread adds the thread after n_thrd_cur
for(int ele_id=blockDim.x * blockIdx.x + threadIdx.x; 
    ele_id<num_items; ele_id+=blockDim.x*gridDim.x){

    shared_tmp += d_in[ele_id];
}
// __syncthreads(); // Sometimes lead to wrong result
acc_tmp[threadIdx.x] = shared_tmp; // assign shared mem
__syncthreads();    // Necessary

Note:

If directly using the shared memory to do accumulation like: acc_tmp[threadIdx.x] += d_in[ele_id];, the result could be wrong because it’s in a loop, where mutliple thread may access the same memory at the same time, due to shared memory is accessible for all threads in a block.
However, the local variable (shared_tmp) reside in register is private for a thread, and other threads can’t access it. So modifying the shared_tmp is safe and necessary.
The last __syncthreads(); cannot be put right after the for loop, and must be after shared memory assignment (on 1050Ti). Otherwise, the result could be wrong. (1080Ti is ok.)

So far, only the “block” of shared memory acc_tmp needs to compute the sum.

The threads reduction is performed through a for loop to adjust the number of threads step-by-step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


// Sum numbers in the shared memory with size of BLOCK_SIZE
// Threads reduce by half each step
// Initial number of threads is a half of the total data
for (n_thrd_cur=BLOCK_SIZE/2; n_thrd_cur>=1; n_thrd_cur/=2){
  // Let a thread to do an operations: plus
  // Temporary variable is necessary for memory safety:
  int sum_tmp = 0;

  // Only use threads required
  if (threadIdx.x < n_thrd_cur){
      sum_tmp = acc_tmp[threadIdx.x] 
                + acc_tmp[threadIdx.x + n_thrd_cur];
  }
  __syncthreads(); // Necessary 
  // as write after read for the same memory
  // Can't reside in if or other brach syntax

  // Write result back to memory
  if (threadIdx.x < n_thrd_cur){
      acc_tmp[threadIdx.x] = sum_tmp;
  }
  __syncthreads(); // Necessary for 1050Ti, 1080Ti
}

Finally, the sum of a block (shared memory) is stored in acc_tmp[0].

__syncthreads() is used when a memory is read followed by writing/modification to avoid data Hazard-wiki (Race condition-wiki, Memory safety-wiki)

__syncthreads() can’t reside in if because it’s a branch. Otherwise, when multiple threads run in parallel, threads may go different branches, consequently, leading to errors.

atomicAdd guarantees the read/write to an address won’t be disrupted by other threads.

When adding the summation of each block acc_tmp[0] (shared memory) upto d_out (global memory), multiple threads access the same global memory d_out, so atomicAdd is applied:

1
2
3
4
5
6
7
8


// Accumulate all blocks in the grid
// The sum of each block was stored in acc_tmp[0]
// Each block uses 1 thread to add its sum to the total sum of all blocks
if (blockIdx.x * blockDim.x < num_items){
    if (threadIdx.x == 0){
        atomicAdd(d_out, acc_tmp[0]);
    }
}

Full code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90


#include <stdio.h>

#define N 10000000  // number of data. No = and column ;
#define BLOCK_SIZE 256  // blockDim.x
#define GRID_SIZE 32 // gridDim.x, the total threads allocated: 8192

__managed__ int source[N];  // allocate memory for input data
__managed__ int result_gpu[1] = {0};  // output the sum of N items

// Remeber to declare types for args
__global__ void sum_gpu(int* d_in, int num_items, int* d_out){
    // grid loop
    __shared__ int acc_tmp[BLOCK_SIZE];

    int shared_tmp = 0;
    for (int ele_id=blockIdx.x * blockDim.x + threadIdx.x; 
         ele_id < num_items; ele_id+=BLOCK_SIZE*GRID_SIZE){
        shared_tmp += d_in[ele_id];
    }
    acc_tmp[threadIdx.x] = shared_tmp;
    __syncthreads();

    // threads reduction
    int sum_tmp = 0;
    for (int thd_cur=BLOCK_SIZE/2; thd_cur>=1; thd_cur/=2){
        if (threadIdx.x < thd_cur){
            sum_tmp = acc_tmp[threadIdx.x] 
                      + acc_tmp[threadIdx.x + thd_cur];
        }
        __syncthreads();
        if (threadIdx.x < thd_cur){
            acc_tmp[threadIdx.x] = sum_tmp;
        }
        __syncthreads();
    }

    // accumulate all blocks
    if (blockIdx.x * blockDim.x < num_items){
        if (threadIdx.x == 0){
            // d_out[0] += acc_tmp[0];
            atomicAdd(d_out, acc_tmp[0]);
        }
    }
}

int main(){
    // Initialize source data
    // Can't use: for(int& i :sourace)
    for (int i=0; i<N; i++)
        source[i] = rand()%10;

    // record time
    cudaEvent_t start, stop_gpu, stop_cpu;
    cudaEventCreate(&start);
    cudaEventCreate(&stop_gpu);
    cudaEventCreate(&stop_cpu);

    cudaEventRecord(start);
    cudaEventSynchronize(start);

    for (int i=0; i<20; i++){   // avg time for 20 rounds
        result_gpu[0] = 0;  // clear value
        sum_gpu<<<GRID_SIZE, BLOCK_SIZE>>>(source, N, result_gpu);
        cudaDeviceSynchronize();    // wait gpu
    }

    cudaEventRecord(stop_gpu);
    cudaEventSynchronize(stop_gpu);

    // cpu execution
    int result_cpu = 0;
    for (int i=0; i<N; i++)
        result_cpu += source[i];
    
    cudaEventRecord(stop_cpu);
    cudaEventSynchronize(stop_cpu);

    float time_gpu, time_cpu;
    cudaEventElapsedTime(&time_gpu, start, stop_gpu);
    cudaEventElapsedTime(&time_cpu, stop_gpu, stop_cpu);

    cudaEventDestroy(start);
    cudaEventDestroy(stop_gpu);
    cudaEventDestroy(stop_cpu);

    printf("Time on gpu: %.2f, Time on cpu: %.2f\n", time_gpu/20, time_cpu);
    printf("%s\n", (result_gpu[0] == result_cpu) ? "Equal" : "Error");
    printf("Sum on gpu: %d, Sum on cpu: %d\n", *result_gpu, result_cpu);
    return 0;
}

Output:

1
2
3
4
5


(base) yi@yi:~/Downloads/CUDA_Study$ nvcc Tut_KenHe/8_reduction.cu 
(base) yi@yi:~/Downloads/CUDA_Study$ ./a.out 
Time on gpu: 1.15, Time on cpu: 26.45
Equal
Sum on gpu: 45011704, Sum on cpu: 45011704

Table of contents

Reduction 归约