- Source video: 【并行计算】CUDA在现代C++中如何运用?看这一个就够了!- 双笙子佯谬 - bilibili
- parallel101/course - Github
- Testing repo
- Textbook; pdf
Enable CUDA in CMake
|
|
- CUDA syntax is compatible with C++, so nvcc can compile a C++ project by chaning all
.cppfiles renamed to.cu. The nvcc can compile CPU and GPU code jointly.
CPU-GPU Asyncronous
For the sake of efficiency, after CPU tells GPU to run the kernel function (decorated by __global__),
CPU proceeds to the next line of code without waiting for the GPU to finish the computation.
Therefore, in the following code, the printf won’t be executed because programs returns directly after
CPU pushes the task to GPU execution queue.
However, the GPU didn’t have time to execute and return results.
|
|
- Compile:
nvcc test_async.cu. Execute application:./a.out.
Set the program to wait for GPU completing all the tasks in its queue:
|
|
-
std::coutandstd::endlare “host (CPU) functions”, which can’t be executed on GPU.1test_async.cu(7): error: calling a __host__ function ("std::basic_ostream<char, st...") from a __global__ function("kernel") is not allowed -
__host__functions are compiled to callable only for other host functions. NV Forums
Function types
Docs - Sec 7.1 Function Execution Space Specifiers
-
__global__function: called from the host or other devices, and executed on the device. -
__device__function: called from other__device__(or__global__) functions and executed on device. -
__host__function: called from__host__functions and executed on CPU.A function without decorated by any execution space specifier is compiled as a
__host__function.
-
Calling a
__device__function (from other devices) doesn’t need<<< >>>, as it’s called on the GPU interally:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15#include <cuda_runtime.h> #include <cstdio> __device__ void say_hello() { printf("hello\n"); } __global__ void kernel() { say_hello(); } int main() { kernel<<<1, 1>>>(); cudaDeviceSynchronize(); return 0; }-
Since
__gloabl__functions are asyncronous and won’t return immediately, their return type must bevoid.However, the
__device__can have return value, like a normal function.
-
-
A function can be called from either GPU or CPU with using both specifier:
__host__ __device__1 2 3 4 5 6 7 8 9 10 11 12 13 14#include <cstdio> #include <cuda_runtime.h> __host__ __device__ void say_hello() { printf("hello~\n"); } __global__ void kernel() { say_hello(); } int main() { kernel<<<1, 1>>>(); // gpu version cudaDeviceSynchronize(); say_hello(); // cpu version }Wil the computation in
say_helloexecuted both by CPU and GPU? -
The
constexprkeyword can be replaced with__host__ __device__by nvcc compiler to enable aconstexprfunction (e.g., math function) can be called from either a host or a device.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17#include <cstdio> #include <cuda_runtime.h> constexpr const char* cuthead(const char* p) { return p + 1; } __global__ void kernel() { printf(cuthead("Hello World!\n")); } int main() { kernel<<<1, 1>>>(); cudaDeviceSynchronize(); print(cuthead("ABC\n")); return 0; }By decorating with
__host__ __device__, theconstexprfunction will be inlined automatically.Enable the nvcc flag
--expt-relaxed-constexprwith a “CMake的生成器表达式来实现只对 .cu 文件有效,而不会在 gcc 编译 .c 文件时生效,不然给到 gcc 就出错了” (?):1 2add_executable(main main.cu foo.cpp) target_compile_options(main PUBLIC $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>)- However, on the contrary,
__host__ __device__can’t be replaced withconstexpr, becauseconstexprfunction cannot callprintfand GPU-specific functions, like_syncthreads.
- However, on the contrary,
inline device function
-
If appropriate, the compiler will inline
__device__functions automatically.- When the function body is too big, the compiler may won’t insert code.
-
__noinline__declares a function that won’t be inserted into the place where it’s called.And
__forceinline__is the opposite.