Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Toolkit 12.4.0 tuple incompatibility #3690

Open
4 tasks done
runer112 opened this issue Mar 8, 2024 · 11 comments · May be fixed by #3744
Open
4 tasks done

CUDA Toolkit 12.4.0 tuple incompatibility #3690

runer112 opened this issue Mar 8, 2024 · 11 comments · May be fixed by #3744

Comments

@runer112
Copy link

runer112 commented Mar 8, 2024

System information (version)
  • OpenCV => 4.9.0
  • Operating System / Platform => Windows 64 Bit
  • Compiler => Visual Studio 2022
Detailed description

opencv with CUDA support cannot be built using CUDA Toolkit 12.4.0.

While CUDA Toolkit 12.3.2 uses thrust version 2.2.0 (https://docs.nvidia.com/cuda/archive/12.3.2/cuda-toolkit-release-notes/index.html), CUDA Toolkit 12.4.0 updates to thrust version 2.3.1 (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html). In thrust version 2.3.0, the tuple implementation was replaced with a standard tuple implementaton (NVIDIA/cccl#262). Notably, this changes the definition from a 10-parameter template to a variable-parameter template. So instead of a tuple of n items being padded out with 10 - n null types to always have 10 template parameters, it now only has n template parameters. This makes the function templates in cudev specified with 10 template parameters per tuple no longer viable for tuples not of size 10.

An example of one such function template that's no longer viable, cv::cudev::blockReduce:

template <int N,
typename P0, typename P1, typename P2, typename P3, typename P4, typename P5, typename P6, typename P7, typename P8, typename P9,
typename R0, typename R1, typename R2, typename R3, typename R4, typename R5, typename R6, typename R7, typename R8, typename R9,
class Op0, class Op1, class Op2, class Op3, class Op4, class Op5, class Op6, class Op7, class Op8, class Op9>
__device__ __forceinline__ void blockReduce(const tuple<P0, P1, P2, P3, P4, P5, P6, P7, P8, P9>& smem,
const tuple<R0, R1, R2, R3, R4, R5, R6, R7, R8, R9>& val,
uint tid,
const tuple<Op0, Op1, Op2, Op3, Op4, Op5, Op6, Op7, Op8, Op9>& op)
{
block_reduce_detail::Dispatcher<N>::reductor::template reduce<
const tuple<P0, P1, P2, P3, P4, P5, P6, P7, P8, P9>&,
const tuple<R0, R1, R2, R3, R4, R5, R6, R7, R8, R9>&,
const tuple<Op0, Op1, Op2, Op3, Op4, Op5, Op6, Op7, Op8, Op9>&>(smem, val, tid, op);
}

An example of an error I encounter:

[build] Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev\grid\detail/reduce.hpp(379): error : no instance of overloaded function "cv::cudev::blockReduce" matches the argument list [Z:\dev\1\opencv\out\build\user\modules\world\opencv_world.vcxproj]
[build]               argument types are: (cuda::std::__4::tuple<volatile int *, volatile int *>, cuda::std::__4::tuple<int &, int &>, int, cuda::std::__4::tuple<cv::cudev::minimum<int>, cv::cudev::maximum<int>>)
[build]                 blockReduce<BLOCK_SIZE>(smem_tuple(sminval, smaxval), tie(mymin, mymax), tid, make_tuple(minOp, maxOp));
[build]                 ^
[build]   Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev/block/reduce.hpp(72): note #3327-D: candidate function template "cv::cudev::blockReduce<N,P0,P1,P2,P3,P4,P5,P6,P7,P8,P9,R0,R1,R2,R3,R4,R5,R6,R7,R8,R9,Op0,Op1,Op2,Op3,Op4,Op5,Op6,Op7,Op8,Op9>(const thrust::THRUST_200301_500_520_600_610_700_750_800_860_890_900_NS::tuple<P0, P1, P2, P3, P4, P5, P6, P7, P8, P9> &, const thrust::THRUST_200301_500_520_600_610_700_750_800_860_890_900_NS::tuple<R0, R1, R2, R3, R4, R5, R6, R7, R8, R9> &, uint, const thrust::THRUST_200301_500_520_600_610_700_750_800_860_890_900_NS::tuple<Op0, Op1, Op2, Op3, Op4, Op5, Op6, Op7, Op8, Op9> &)" failed deduction
[build]     __declspec(__device__) __forceinline void blockReduce(const tuple<P0, P1, P2, P3, P4, P5, P6, P7, P8, P9>& smem,
[build]                                               ^
[build]   Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev/block/reduce.hpp(63): note #3327-D: candidate function template "cv::cudev::blockReduce<N,T,Op>(volatile T *, T &, uint, const Op &)" failed deduction
[build]     __declspec(__device__) __forceinline void blockReduce(volatile T* smem, T& val, uint tid, const Op& op)
[build]                                               ^
[build]             detected during:
[build]               instantiation of "void cv::cudev::grid_reduce_detail::MinMaxReductor<cv::cudev::grid_reduce_detail::both, src_type, work_type>::reduceGrid<BLOCK_SIZE>(work_type *, int) [with src_type=uchar, work_type=int, BLOCK_SIZE=256]" at line 412
[build]               instantiation of "void cv::cudev::grid_reduce_detail::reduce<Reductor,BLOCK_SIZE,PATCH_X,PATCH_Y,SrcPtr,ResType,MaskPtr>(SrcPtr, ResType *, MaskPtr, int, int) [with Reductor=cv::cudev::grid_reduce_detail::MinMaxReductor<cv::cudev::grid_reduce_detail::both, uchar, int>, BLOCK_SIZE=256, PATCH_X=4, PATCH_Y=4, SrcPtr=cv::cudev::GlobPtr<uchar>, ResType=int, MaskPtr=cv::cudev::WithOutMask]" at line 421
[build]               instantiation of "void cv::cudev::grid_reduce_detail::reduce<Reductor,Policy,SrcPtr,ResType,MaskPtr>(const SrcPtr &, ResType *, const MaskPtr &, int, int, cudaStream_t) [with Reductor=cv::cudev::grid_reduce_detail::MinMaxReductor<cv::cudev::grid_reduce_detail::both, uchar, int>, Policy=cv::cudev::DefaultGlobReducePolicy, SrcPtr=cv::cudev::GlobPtr<uchar>, ResType=int, MaskPtr=cv::cudev::WithOutMask]" at line 460
[build]               instantiation of "void cv::cudev::grid_reduce_detail::minMaxVal<Policy,SrcPtr,ResType,MaskPtr>(const SrcPtr &, ResType *, const MaskPtr &, int, int, cudaStream_t) [with Policy=cv::cudev::DefaultGlobReducePolicy, SrcPtr=cv::cudev::GlobPtr<uchar>, ResType=int, MaskPtr=cv::cudev::WithOutMask]" at line 206 of Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev/grid/reduce.hpp
[build]               instantiation of "void cv::cudev::gridFindMinMaxVal_<Policy,SrcPtr,ResType>(const SrcPtr &, cv::cudev::GpuMat_<ResType> &, cv::cuda::Stream &) [with Policy=cv::cudev::DefaultGlobReducePolicy, SrcPtr=cv::cudev::GpuMat_<uchar>, ResType=int]" at line 349 of Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev/grid/reduce.hpp
[build]               instantiation of "void cv::cudev::gridFindMinMaxVal(const SrcPtr &, cv::cudev::GpuMat_<ResType> &, cv::cuda::Stream &) [with SrcPtr=cv::cudev::GpuMat_<uchar>, ResType=int]" at line 68 of Z:\dev\1\opencv_contrib\modules\cudaarithm\src\cuda\minmax.cu
[build]               instantiation of "void <unnamed>::minMaxImpl<T,R>(const cv::cuda::GpuMat &, const cv::cuda::GpuMat &, cv::cuda::GpuMat &, cv::cuda::Stream &) [with T=uchar, R=int]" at line 92 of Z:\dev\1\opencv_contrib\modules\cudaarithm\src\cuda\minmax.cu

The first candidate but nonviable function template shown in the error message is the one linked above, which was viable and selected in previous CUDA Toolkit versions.

I think that all templates specifying 10 template parameters per tuple can be updated to work with the new tuple definition by replacing each set of 10 template parameters with a parameter pack. I think this should still be compatible with the old tuple definition, as well. For example, I think this would be a viable implementation of cv::cudev::blockReduce:

template <int N, typename... P, typename... R, class... Op>
__device__ __forceinline__ void blockReduce(const tuple<P...>& smem,
                                            const tuple<R...>& val,
                                            uint tid,
                                            const tuple<Op...>& op)
{
    block_reduce_detail::Dispatcher<N>::reductor::template reduce<
        const tuple<P...>&,
        const tuple<R...>&,
        const tuple<Op...>&>(smem, val, tid, op);
}
Steps to reproduce

Attempt to build cudev using CUDA Toolkit 12.4.0. I suspect that this error will be observed with any combination of OpenCV version, OS, platform, and compiler (that are modern enough to not encounter some other error first).

Issue submission checklist
  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues,
    forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files: videos, images, onnx, etc
@sjuxax
Copy link

sjuxax commented Mar 12, 2024

Just to confirm your suspicion that this affects cross-platform builds, getting the same errors on Linux with GCC 13:

opencv_contrib-4.9.0/modules/cudev/include/opencv2/cudev/grid/detail/reduce_to_column.hpp(73): error: no instance of overloaded function "cv::cudev::blockReduce" matches the argument list                                                 
            argument types are: (cuda::std::__4::tuple<volatile float *, volatile float *>, cuda::std::__4::tuple<float &, float &>, const unsigned int, cuda::std::__4::tuple<cv::cudev::Sum<float>, cv::cudev::Sum<float>>)                                           
              blockReduce<BLOCK_SIZE>(smem_tuple(smem[0], smem[1]), tie(myVal.x, myVal.y), threadIdx.x, make_tuple(op, op));

^ one such error

@juls007
Copy link

juls007 commented Mar 24, 2024

I have the same issue when building latest OpenCV 4 from source with Cuda 12.4,, cudnn 9 and gcc 13, on Fedora 39

General configuration for OpenCV 4.9.0-dev =====================================
--   Version control:               4.9.0-293-g912cf2a028
-- 
--   Extra modules:
--     Location (extra):            /home/coder/projects/opencv-src/opencv_contrib/modules
--     Version control (extra):     4.9.0-51-gab821068
-- 
--   Platform:
--     Timestamp:                   2024-03-29T14:50:46Z
--     Host:                        Linux 6.7.10-200.fc39.x86_64 x86_64
--     CMake:                       3.27.7
--     CMake generator:             Unix Makefiles
--     CMake build tool:            /usr/bin/gmake
--     Configuration:               RELEASE
-- 
--   CPU/HW features:
--     Baseline:                    SSE SSE2 SSE3
--       requested:                 SSE3
--     Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
--       requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
--       SSE4_1 (18 files):         + SSSE3 SSE4_1
--       SSE4_2 (2 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
--       FP16 (1 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
--       AVX (9 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
--       AVX2 (38 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
--       AVX512_SKX (8 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX
-- 
--   C/C++:
--     Built as dynamic libs?:      YES
--     C++ standard:                11
--     C++ Compiler:                /usr/lib64/ccache/c++  (ver 13.2.1)
--     C++ flags (Release):         -fsigned-char -ffast-math -fno-finite-math-only -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
--     C++ flags (Debug):           -fsigned-char -ffast-math -fno-finite-math-only -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
--     C Compiler:                  /usr/lib64/ccache/cc
--     C flags (Release):           -fsigned-char -ffast-math -fno-finite-math-only -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
--     C flags (Debug):             -fsigned-char -ffast-math -fno-finite-math-only -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
--     Linker flags (Release):      -Wl,--exclude-libs,libippicv.a -Wl,--exclude-libs,libippiw.a   -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined  
--     Linker flags (Debug):        -Wl,--exclude-libs,libippicv.a -Wl,--exclude-libs,libippiw.a   -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined  
--     ccache:                      YES
--     Precompiled headers:         NO
--     Extra dependencies:          m pthread cudart_static dl rt nppc nppial nppicc nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cudnn cufft -L/usr/local/cuda/lib64 -L/lib64
--     3rdparty dependencies:
-- 
--   OpenCV modules:
--     To be built:                 alphamat aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann freetype fuzzy gapi hdf hfs highgui img_hash imgcodecs imgproc intensity_transform java line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot python3 quality rapid reg rgbd saliency sfm shape signal stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab viz wechat_qrcode xfeatures2d ximgproc xobjdetect xphoto
--     Disabled:                    cudacodec world
--     Disabled by dependency:      -
--     Unavailable:                 cannops cvv julia matlab ovis python2
--     Applications:                tests perf_tests apps
--     Documentation:               NO
--     Non-free algorithms:         YES
-- 
--   GUI:                           GTK3
--     GTK+:                        YES (ver 3.24.41)
--       GThread :                  YES (ver 2.78.3)
--       GtkGlExt:                  NO
--     VTK support:                 YES (ver 9.2.6)
-- 
--   Media I/O: 
--     ZLib:                        /lib64/libz.so (ver 1.2.13)
--     JPEG:                        /lib64/libjpeg.so (ver 62)
--     WEBP:                        /lib64/libwebp.so (ver encoder: 0x020f)
--     PNG:                         /lib64/libpng.so (ver 1.6.37)
--     TIFF:                        /lib64/libtiff.so (ver 42 / 4.4.0)
--     JPEG 2000:                   OpenJPEG (ver 2.5.0)
--     OpenEXR:                     OpenEXR::OpenEXR (ver 3.1.10)
--     HDR:                         YES
--     SUNRASTER:                   YES
--     PXM:                         YES
--     PFM:                         YES
-- 
--   Video I/O:
--     DC1394:                      YES (2.2.7)
--     FFMPEG:                      YES
--       avcodec:                   YES (60.31.102)
--       avformat:                  YES (60.16.100)
--       avutil:                    YES (58.29.100)
--       swscale:                   YES (7.5.100)
--       avresample:                NO
--     GStreamer:                   YES (1.22.9)
--     v4l/v4l2:                    YES (linux/videodev2.h)
-- 
--   Parallel framework:            TBB (ver 2020.3 interface 11103)
-- 
--   Trace:                         YES (with Intel ITT)
-- 
--   Other third-party libraries:
--     Intel IPP:                   2021.10.1 [2021.10.1]
--            at:                   /home/coder/projects/opencv-src/opencv/build/3rdparty/ippicv/ippicv_lnx/icv
--     Intel IPP IW:                sources (2021.10.1)
--               at:                /home/coder/projects/opencv-src/opencv/build/3rdparty/ippicv/ippicv_lnx/iw
--     VA:                          YES
--     Lapack:                      YES (/lib64/libopenblas.so)
--     Eigen:                       YES (ver 3.4.0)
--     Custom HAL:                  NO
--     Protobuf:                    build (3.19.1)
--     Flatbuffers:                 builtin/3rdparty (23.5.9)
-- 
--   NVIDIA CUDA:                   YES (ver 12.4, CUFFT CUBLAS FAST_MATH)
--     NVIDIA GPU arch:             75
--     NVIDIA PTX archs:            75
-- 
--   cuDNN:                         YES (ver 9.0.0)
-- 
--   OpenCL:                        YES (INTELVA)
--     Include path:                /home/coder/projects/opencv-src/opencv/3rdparty/include/opencl/1.2
--     Link libraries:              Dynamic load
-- 
--   Python 3:
--     Interpreter:                 /usr/bin/python3 (ver 3.12.2)
--     Libraries:                   /lib64/libpython3.12.so (ver 3.12.2)
--     Limited API:                 NO
--     numpy:                       /usr/lib64/python3.12/site-packages/numpy/core/include (ver 1.24.4)
--     install path:                lib/python3.12/site-packages/cv2/python-3.12
-- 
--   Python (for build):            /usr/bin/python3
-- 
--   Java:                          
--     ant:                         NO
--     Java:                        YES (ver 17.0.10)
--     JNI:                         /home/coder/.sdkman/candidates/java/current/include /home/coder/.sdkman/candidates/java/current/include/linux /home/coder/.sdkman/candidates/java/current/include
--     Java wrappers:               YES (JAVA)
--     Java tests:                  NO
-- 
--   Install to:                    /usr/local
-- -----------------------------------------------------------------

@moodzunl
Copy link

Having the same Issue when building latest OpenCV 4 from Source on Windows 11.

@HellmannM
Copy link

HellmannM commented Mar 26, 2024

I agree, this should be fixable the way you describe it. However:
tuple_size would need to get replaced as well. Probably straight-forward... It is used inside templates, where the parameter types are not directly visible. Example (last line):

template <class SrcPtr, class DstPtrTuple, class OpTuple, class MaskPtr>
__global__ void transform_tuple(const SrcPtr src, DstPtrTuple dst, const OpTuple op, const MaskPtr mask, const int rows, const int cols)
{   
    const int x = blockIdx.x * blockDim.x + threadIdx.x;
    const int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= cols || y >= rows || !mask(y, x)) 
        return;

    typename PtrTraits<SrcPtr>::value_type srcVal = src(y, x); 

    Unroll<tuple_size<DstPtrTuple>::value>::transform(srcVal, dst, op, y, x); 
}

Here one instance is compiled with
DstPtrTuple=cv::cudev::ZipPtr<cuda::std::__4::tuple<cv::cudev::GlobPtr<float>, cv::cudev::GlobPtr<float>>>
How can the correct template parameters be restored? Not expanding to the correct nested template would give the wrong result. Best I could come up with was:

template <typename T>
struct tuple_size {};

template <typename... P>
struct tuple_size< tuple<P...> >
{
    static const int value = sizeof...(P);
};

template <template <typename S> typename T, typename... P>
struct tuple_size< T<tuple<P...>> >
{   
    static const int value = sizeof...(P);
}; 

This seems to work for the case mentioned above. I am not sure however, if this will give correct result in all cases. Maybe someone can give some feedback? Or any ideas how this could be solved more elegantly?

@stefanboca
Copy link

stefanboca commented Mar 28, 2024

Alternatively, Thrust's tuple_size can be specialized for ZipPtr<tuple<...>>.
Something like:

// placed at the end of modules/cudev/include/opencv2/cudev/ptr2d/zip.hpp, in the global namespace

_LIBCUDACXX_BEGIN_NAMESPACE_STD

template<class Ptr0, class Ptr1>
struct tuple_size<cv::cudev::ZipPtr<tuple<Ptr0, Ptr1>>> : tuple_size<tuple<Ptr0, Ptr1>> {};

template<class Ptr0, class Ptr1, class Ptr2>
struct tuple_size<cv::cudev::ZipPtr<tuple<Ptr0, Ptr1, Ptr2>>> : tuple_size<tuple<Ptr0, Ptr1, Ptr2>> {};

template<class Ptr0, class Ptr1, class Ptr2, class Ptr3>
struct tuple_size<cv::cudev::ZipPtr<tuple<Ptr0, Ptr1, Ptr2, Ptr3>>> : tuple_size<tuple<Ptr0, Ptr1, Ptr2, Ptr3>> {};


template<class Ptr0, class Ptr1>
struct tuple_size<cv::cudev::ZipPtrSz<tuple<Ptr0, Ptr1>>> : tuple_size<tuple<Ptr0, Ptr1>> {};

template<class Ptr0, class Ptr1, class Ptr2>
struct tuple_size<cv::cudev::ZipPtrSz<tuple<Ptr0, Ptr1, Ptr2>>> : tuple_size<tuple<Ptr0, Ptr1, Ptr2>> {};

template<class Ptr0, class Ptr1, class Ptr2, class Ptr3>
struct tuple_size<cv::cudev::ZipPtrSz<tuple<Ptr0, Ptr1, Ptr2, Ptr3>>> : tuple_size<tuple<Ptr0, Ptr1, Ptr2, Ptr3>> {};


template<size_t N, class Ptr0, class Ptr1>
struct tuple_element<N, cv::cudev::ZipPtr<tuple<Ptr0, Ptr1>>> : tuple_element<N, tuple<Ptr0, Ptr1>> {};

template<size_t N, class Ptr0, class Ptr1, class Ptr2>
struct tuple_element<N, cv::cudev::ZipPtr<tuple<Ptr0, Ptr1, Ptr2>>> : tuple_element<N, tuple<Ptr0, Ptr1, Ptr2>> {};

template<size_t N, class Ptr0, class Ptr1, class Ptr2, class Ptr3>
struct tuple_element<N, cv::cudev::ZipPtr<tuple<Ptr0, Ptr1, Ptr2, Ptr3>>> : tuple_element<N, tuple<Ptr0, Ptr1, Ptr2, Ptr3>> {};

_LIBCUDACXX_END_NAMESPACE_STD

Thrust does this for backwards compatibility with the old style of tuples as well. It also appears that tuple_element needs to be fixed as well, so I've included that for completeness.

In addition to the parameter packing changes mentioned above, I've successfully compiled OpenCV using this method.

negril added a commit to negril/gentoo that referenced this issue Mar 31, 2024
Also limit cuda interaction to ABI_X86_64.

Bug: opencv/opencv_contrib#3690
Signed-off-by: Paul Zander <negril.nx+gentoo@gmail.com>
gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Apr 1, 2024
Also limit cuda interaction to ABI_X86_64.

Bug: opencv/opencv_contrib#3690
Signed-off-by: Paul Zander <negril.nx+gentoo@gmail.com>
Closes: #36020
Signed-off-by: Joonas Niilola <juippis@gentoo.org>
@miscco
Copy link

miscco commented Apr 15, 2024

I am on of the maintainers of the cccl libraries at NVIDIA.

We recently updated our old thrust::tuple implementation to be an alias for cuda::std::tuple. Unfortunately, when providing the necessary backfills for thrust::tuple_size to work with thrust::null_type someone (me) missed to add the final overload for a 10 element tuple. My apologies for the disruption this bug has caused here.

This has been fixed after this issue was raised here.

There are different potential ways of working around this issue in the near / mid term:

  1. You could pull in latest cccl from github and use that instead of the version packaged with the CTK 12.4. CMake should be able to find ToT with find_package(CCCL)
  2. You could provide a temporary workaround similar to the fix we employed
  3. (mid term) You should remove all explicit specializations of tuple that rely on the old 10-param tuple and simply use the number of tuple elements you actually need, aka replace thrust::tuple<T1, T2, thrust::null_type,...> with thrust::tuple<T1, T2>

@devops-golang
Copy link

I am on of the maintainers of the cccl libraries at NVIDIA.

We recently updated our old thrust::tuple implementation to be an alias for cuda::std::tuple. Unfortunately, when providing the necessary backfills for thrust::tuple_size to work with thrust::null_type someone (me) missed to add the final overload for a 10 element tuple. My apologies for the disruption this bug has caused here.

This has been fixed after this issue was raised here.

There are different potential ways of working around this issue in the near / mid term:

  1. You could pull in latest cccl from github and use that instead of the version packaged with the CTK 12.4. CMake should be able to find ToT with find_package(CCCL)
  2. You could provide a temporary workaround similar to the fix we employed
  3. (mid term) You should remove all explicit specializations of tuple that rely on the old 10-param tuple and simply use the number of tuple elements you actually need, aka replace thrust::tuple<T1, T2, thrust::null_type,...> with thrust::tuple<T1, T2>

how to replace?pull and cmake? which the cmake parameters?
when i use cmake .. it turns out error : Could not find libcudacxx_LIT using the following names: lit

@miscco
Copy link

miscco commented Apr 24, 2024

how to replace?pull and cmake? which the cmake parameters? when i use cmake .. it turns out error : Could not find libcudacxx_LIT using the following names: lit

You could use CPM like:

include(cmake/CPM.cmake)​
CPMAddPackage("gh:NVIDIA/cccl#main")
​target_link_libraries(PROJECT CCCL::CCCL)

@jiapei100
Copy link

Well... Still NOT quite get it... Do we have the solution already???

Have cccl built and replaced with the default ones installed with CUDA-Toolkit 12.4??

Thanks

@LiuToki
Copy link

LiuToki commented May 7, 2024

Well... Still NOT quite get it... Do we have the solution already???

Have cccl built and replaced with the default ones installed with CUDA-Toolkit 12.4??

Thanks

I was able to build the library using CUDA Toolkit 12.3.2 in my environment(through vcpkg). This is one way to use it.

Also, the above cccl fixes seem to be going into v2.4.0.
Even the latest version of the CUDA Toolkit at this time, 12.4.1, still seems to use an unfixed cccl.
Please look at the NVIDIA CUDA Toolkit Release Notes for 12.4 Update 1.

@asmorkalov
Copy link
Contributor

CUDA Toolkit 12.5 still has the bug.

@asmorkalov asmorkalov linked a pull request May 24, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.