watch: Parallel101 - 彭于斌

Enable CUDA in CMake

1
2
3
4
5
6
7
8
9


cmake_minimum_required(VERSION 3.10) 

set(CMAKE_CXX_STANDARD 17) 
set(CMAKE_BUILD_TYPE Release)

# add CUDA
project(hellocuda LANGUAGES CXX CUDA)

add_executable(main main.cu)

CUDA syntax is compatible with C++, so nvcc can compile a C++ project by chaning all .cpp files renamed to .cu. The nvcc can compile CPU and GPU code jointly.

CPU-GPU Asyncronous

For the sake of efficiency, after CPU tells GPU to run the kernel function (decorated by __global__), CPU proceeds to the next line of code without waiting for the GPU to finish the computation.

Therefore, in the following code, the printf won’t be executed because programs returns directly after CPU pushes the task to GPU execution queue. However, the GPU didn’t have time to execute and return results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// Filename: test_async.cu
#include <cstdio>

__global__ void kernel() {
    printf("Hello World!\n");
}

int main(){
    kernel<<<1,1>>>();
    return 0;
}

Compile: nvcc test_async.cu. Execute application: ./a.out.

Set the program to wait for GPU completing all the tasks in its queue:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


#include <cstdio>
#include <cuda_runtime.h>
#include <iostream>

__global__ void kernel() {
    printf("Hello World!\n");
    // std::cout << "out" << std::endl;
}

int main(){
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    return 0;
}

std::cout and std::endl are “host (CPU) functions”, which can’t be executed on GPU.

1

test_async.cu(7): error: calling a __host__ function ("std::basic_ostream<char, st...") from a __global__ function("kernel") is not allowed

__host__ functions are compiled to callable only for other host functions. NV Forums

Function types

Docs - Sec 7.1 Function Execution Space Specifiers

__global__ function: called from the host or other devices, and executed on the device.
__device__ function: called from other __device__ (or __global__) functions and executed on device.
__host__ function: called from __host__ functions and executed on CPU.

A function without decorated by any execution space specifier is compiled as a __host__ function.

Calling a __device__ function (from other devices) doesn’t need <<< >>>, as it’s called on the GPU interally:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


#include <cuda_runtime.h>
#include <cstdio>
__device__ void say_hello() {
    printf("hello\n");
}

__global__ void kernel() {
    say_hello();
}

int main() {
    kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Since __gloabl__ functions are asyncronous and won’t return immediately, their return type must be void.

However, the __device__ can have return value, like a normal function.

A function can be called from either GPU or CPU with using both specifier: __host__ __device__

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


#include <cstdio>
#include <cuda_runtime.h>

__host__ __device__ void say_hello() {
    printf("hello~\n");
}
__global__ void kernel() {
    say_hello();
}
int main() {
    kernel<<<1, 1>>>();  // gpu version
    cudaDeviceSynchronize();
    say_hello(); // cpu version
}

Wil the computation in say_hello executed both by CPU and GPU?

The constexpr keyword can be replaced with __host__ __device__ by nvcc compiler to enable a constexpr function (e.g., math function) can be called from either a host or a device.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


#include <cstdio>
#include <cuda_runtime.h>

constexpr const char* cuthead(const char* p) {
    return p + 1;
}

__global__ void kernel() {
    printf(cuthead("Hello World!\n"));
}

int main() {
    kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
    print(cuthead("ABC\n"));
    return 0;
}

By decorating with __host__ __device__, the constexpr function will be inlined automatically.

Enable the nvcc flag --expt-relaxed-constexpr with a “CMake的生成器表达式来实现只对 .cu 文件有效，而不会在 gcc 编译 .c 文件时生效，不然给到 gcc 就出错了” (?):

1
2


add_executable(main main.cu foo.cpp)
target_compile_options(main PUBLIC $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>)

However, on the contrary, __host__ __device__ can’t be replaced with constexpr, because constexpr function cannot call printf and GPU-specific functions, like _syncthreads.

inline device function

Docs - Sec 7.1.5

If appropriate, the compiler will inline __device__ functions automatically.
- When the function body is too big, the compiler may won’t insert code.
__noinline__ declares a function that won’t be inserted into the place where it’s called.

And __forceinline__ is the opposite.