Introducing CUDA UnBound (CUB) - Microway
InclusiveSum
(2024-01-27)
|
|
-
Example from 3DGS
1 2 3 4 5 6cub::DeviceScan::InclusiveSum( nullptr, // No memory is allocated geom.scan_size, // num of bytes needed to reserve geom.tiles_touched, // input sequence geom.tiles_touched, // becomes prefix sums P) // number of items in the sequence, 100K points- Given a block of
Pbytes of memory pointed to bygeom.tiles_touched, obtaingeom.scan_sizeindicating the number of bytes for temporay storage required to reserve for executing the function:
- Given a block of
SortPairs
(2024-02-03)
Sort key-value pairs according to keys
Example in diff-rast-gass
|
|
(2024-03-05)
Example in simple-KNN of 3DGS
Clustering points (indices) based on morton codes morton:
|
|
Reduce
min
(2024-02-28)
To execute this function, temp_storage_bytes is required to be reserved.
Therefore, this function needs to run twice:
-
Get the number of required temporary bytes, with specifying
d_temp_storageasnullptr; -
Execute the function again given the correct address
d_temp_storage
A simplest code: “cub_reduce.cu”
Full code:
|
|
Notes:
-
If the suffix is
.cppand when compiling withnvcc cub_reduce.cpp, there will be errors about Debug:1 2 3 4 5 6/usr/local/cuda-11.6/bin/../targets/x86_64-linux/include/cub/block/specializations/../../block/../util_debug.cuh:101:43: error: ‘Debug’ is not a member of ‘cub’ 101 | #define CubDebug(e) CUB_NS_QUALIFIER::Debug((cudaError_t) (e), __FILE__, __LINE__) | ^~~~~ /usr/local/cuda-11.6/bin/../targets/x86_64-linux/include/cub/util_allocator.cuh:695:17: note: in expansion of macro ‘CubDebug’ 695 | if (CubDebug(error = cudaSetDevice(entrypoint_device))) return error; | ^~~~~~~~ -
__host__ __forcelinefunction cannot be called, need__device__1 2 3 4 5 6/usr/local/cuda-11.6/bin/../targets/x86_64-linux/include/cub/device/dispatch/dispatch_reduce.cuh(145): error: calling a __host__ function("T1 CustomMin::operator ()<int> (const T1 &, const T1 &) const") from a __global__ function( "cub::DeviceReduceSingleTileKernel< ::cub::DeviceReducePolicy<int, int, int, ::CustomMin> ::Policy600, int *, int *, int, ::CustomMin, int> ") is not allowed /usr/local/cuda-11.6/bin/../targets/x86_64-linux/include/cub/device/dispatch/dispatch_reduce.cuh(145): error: identifier "CustomMin::operator ()<int> const" is undefined in device code -
The customize
min_op(orCustomMin()) can be changed tocub::Sum()to solve the sum of input data.
Ref:
-
CUDA编程模型系列八(原子操作 / 规约 / 向量元素求和) - 扫地的小何尚 (何琨) Found by DDG searching “cuda cub 例子”
sum
(2024-03-04)
-
Move the input data onto GPU first! Otherwise, it won’t be processed and the output variable won’t update and remains the initialized value, e.g., 0.
-
Move the output data back to host for printing! Otherwise, the gpu memory is not allowed to access and return segment fault.
Full code:
|
|
Ref:
-
Example Code Reference by this SO question: Sum reduction with CUB surfaced by DDG with searching: “cub::DeviceReduce::Reduce example tutorial”