Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results with CUDA backend #1408

Open
blinkfrog opened this issue Mar 16, 2024 · 11 comments
Open

Inconsistent results with CUDA backend #1408

blinkfrog opened this issue Mar 16, 2024 · 11 comments
Labels
discussion General discussion about something windows Issues related to using AdaptiveCpp on Windows (which is experimental!)

Comments

@blinkfrog
Copy link
Contributor

blinkfrog commented Mar 16, 2024

Hello,

I've been working with AdaptiveCpp and oneAPI, targeting multiple backends including CPU and CUDA through AdaptiveCpp, as well as CPU@OpenCL and Level Zero through oneAPI. Across all configurations, I'm encountering a peculiar issue specifically with the CUDA backend via AdaptiveCpp, where about 30% of the runs produce incorrect results. This inconsistency rises to 100% when using a simplified reproducer program that I've written for demonstration purposes.

One of algorithms of application I am working on involves constructing and deconstructing an image pyramid. When executing this process using the CUDA backend, I'm observing incorrect data manipulation, as outlined by the sample output below. Notably, this issue does not manifest under the CPU backend or when using oneAPI's alternatives.

While constructing and deconstructing the pyramid, reproducer that I've written construct and deconstruct an image pyramid, and reads and prints data at (0, 0) coordinates of each image level to console.

This is the output (correct) when using CPU-backend:

Running on device: hipSYCL OpenMP host device

Initializing data with 1.0f and constructing image pyramid

Data at 0 level: 1.000000
Data at 1 level: 1.000000
Data at 2 level: 1.000000
Data at 3 level: 1.000000

Changing data to 2.0f and deconstructing image pyramid

Data at 3 level: 2.000000
Data at 2 level: 2.000000
Data at 1 level: 2.000000
Data at 0 level: 2.000000

And this is the output (incorrect) when using CUDA-backend:

Running on device: NVIDIA GeForce RTX 3080 Laptop GPU

Initializing data with 1.0f and constructing image pyramid

Data at 0 level: 1.000000
Data at 1 level: 0.000000
Data at 2 level: 0.000000
Data at 3 level: 0.000000

Changing data to 2.0f and deconstructing image pyramid

Data at 3 level: 2.000000
Data at 2 level: 0.000000
Data at 1 level: 0.000000
Data at 0 level: 0.000000

I've tried to add explicit synchronization points using q.wait() to various places of the program, but this didn't help.

Thank you in advance for your assistance and for the great work on AdaptiveCpp.

Code:

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
#include <string>

const size_t resolution = 1024;
const size_t level_count = 4;

class ImageLevel
{
private:
    sycl::queue& q;

    ImageLevel(sycl::buffer<float, 2> data, sycl::queue& q, ImageLevel* previous_level)
        : q(q), previous_level(previous_level), data(data)
    {
    }

public:
    ImageLevel* previous_level = nullptr;
    sycl::buffer<float, 2> data;


    ImageLevel(sycl::buffer<float, 2> data, sycl::queue& q)
        : q(q),
        data(data)
    {
    }

    std::unique_ptr<ImageLevel> create_subsequent_level()
    {
        return std::make_unique<ImageLevel>(ImageLevel(downsample(data, q), q, this));
    }

    void reconstruct()
    {
        q.submit([&](sycl::handler& h)
        {
            sycl::accessor data_previous(previous_level->data, h, sycl::write_only);
            sycl::accessor data_current(data, h, sycl::read_only);
            h.parallel_for(previous_level->data.get_range(), [=](auto it)
            {
                auto x = it[0] / 2;
                auto y = it[1] / 2;
                data_previous[it] = data_current[sycl::id(y, x)];
            });
        });
    }

private:
    sycl::buffer<float, 2> downsample(sycl::buffer<float, 2>& in, sycl::queue& q)
    {
        sycl::buffer<float, 2> out{ in.get_range() / 2 };
        q.submit([&](sycl::handler& h)
        {
            sycl::accessor data_out(out, h, sycl::write_only, sycl::no_init);
            sycl::accessor data_in(in, h, sycl::read_only);
            h.parallel_for(out.get_range(), [=](auto it)
            {
                auto x = it[0] * 2;
                auto y = it[1] * 2;
                data_out[it] = data_in[sycl::id(y, x)];
            });
        });
        return out;
    }
};


int main(int argc, char* argv[])
{
    try
    {
        auto d_selector{ sycl::gpu_selector_v };
        sycl::queue q(d_selector);
        std::cout << "Running on device: "
            << q.get_device().get_info<sycl::info::device::name>() << "\n\n";
        sycl::buffer<float, 2> data(sycl::range{ resolution, resolution });

        std::cout << "Initializing data with 1.0f and constructing image pyramid\n\n";

        q.submit([&](sycl::handler& h)
        {
            sycl::accessor data_accessor(data, h, sycl::write_only, sycl::no_init);
            h.fill(data_accessor, 1.0f); // initializing original data with 1.0f;
        });

        auto root_level = ImageLevel(data, q);

        {
            sycl::host_accessor dh(root_level.data, sycl::read_only);
            std::cout << "Data at 0 level: " << std::to_string(dh[0][0]) << "\n";
        }

        auto current_level = &root_level;
        std::vector<std::unique_ptr<ImageLevel>> pyramid;
        for (int level = 1; level < level_count; level++)
        {
            pyramid.push_back(current_level->create_subsequent_level());
            current_level = pyramid.back().get();
            sycl::host_accessor dh(current_level->data, sycl::read_only);
            std::cout << "Data at " << level << " level: " << std::to_string(dh[0][0]) << "\n";
        }

        std::cout << "\nChanging data to 2.0f and deconstructing image pyramid\n\n";

        q.submit([&](sycl::handler& h)
        {
            sycl::accessor data_accessor(current_level->data, h, sycl::write_only);
            h.fill(data_accessor, 2.0f); // initializing top-level data with 2.0f;
        });

        {
            sycl::host_accessor dh(current_level->data, sycl::read_only);
            std::cout << "Data at " << level_count - 1 << " level: " << std::to_string(dh[0][0]) << "\n";
        }

        for (int level = level_count - 1; level > 0; level--)
        {
            current_level->reconstruct();
            current_level = current_level->previous_level;
            sycl::host_accessor dh(current_level->data, sycl::read_only);
            std::cout << "Data at " << level - 1 << " level: " << std::to_string(dh[0][0]) << "\n";
        }
    }
    catch (sycl::exception const& e)
    {
        std::cout << "An exception is caught: " << e.what();
        std::terminate();
    }

    return 0;
}
@blinkfrog blinkfrog added the discussion General discussion about something label Mar 16, 2024
@blinkfrog
Copy link
Contributor Author

While investigating this problem today, I've noticed that the out buffer is always zero-filled in the end of downsample function regardless the in-buffer content when using CUDA-backend in AdaptiveCpp (CPU-backend works just fine).

Also, it still is a mystery why in a bigger application similar algorithm works in the most of cases. May be this because in that app I use more complex algorithm which runs slower? May be this is some sort of synchronization problem? Although, I tried to add explicit synchronization points using q.wait() in both the beginning and in the end of this function, and this didn't help.

@illuhad
Copy link
Collaborator

illuhad commented Mar 18, 2024

Hi, I cannot reproduce with --acpp-targets=generic on RTX A5000:

$ ACPP_VISIBILITY_MASK="omp;cuda" ./test
Running on device: NVIDIA RTX A5000

Initializing data with 1.0f and constructing image pyramid

'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
[AdaptiveCpp Warning] kernel_cache: This application run has resulted in new binaries being JIT-compiled. This indicates that the runtime optimization process has not yet reached peak performance. You may want to run the application again until this warning no longer appears to achieve optimal performance.
Data at 0 level: 1.000000
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
Data at 1 level: 1.000000
Data at 2 level: 1.000000
Data at 3 level: 1.000000

Changing data to 2.0f and deconstructing image pyramid

Data at 3 level: 2.000000
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
Data at 2 level: 2.000000
Data at 1 level: 2.000000
Data at 0 level: 2.000000

Please provide more information - AdaptiveCpp version, how you have built things, the compilation flow you are using, output of run with ACPP_DEBUG_LEVEL=3.

EDIT: Note that one important difference between CPU backend and all other backends (including CUDA) is that CPU will directly operate on the host-side allocation of buffer, so there will be no data copies when you use a host_accessor.

@blinkfrog
Copy link
Contributor Author

Thank you very much for your response and for taking the time to look into this issue.

The version of AdaptiveCpp I am currently using is the latest, v24, with this commit: 5777309 (HEAD -> develop, origin/develop, origin/HEAD). Initially, I encountered this error while using v23.

  • Clang 17.0.4
  • CUDA 12.4 (the error was originally encountered on version 12.2)
    I use the standard SMCP CUDA compilation flow, as I am using Windows.

This is CMakeLists.txt for this test

cmake_minimum_required(VERSION 3.5)
project(PyramidTest)

set(AdaptiveCpp_DIR "C:/AdaptiveCpp/lib/cmake/AdaptiveCpp")
set(ACPP_TARGETS "omp;cuda:sm_86")

set(CMAKE_C_COMPILER "C:/llvm/bin/clang.exe")
set(CMAKE_CXX_COMPILER "C:/llvm/bin/clang++.exe")

find_package(AdaptiveCpp CONFIG REQUIRED)

include_directories(${PROJECT_BINARY_DIR} ${PROJECT_SOURCE_DIR})

add_executable(PyramidTest PyramidTest.cpp)
add_sycl_to_target(TARGET PyramidTest SOURCES PyramidTest.cpp)

target_compile_options(PyramidTest PRIVATE
    -O3
)

Here is the full output with ACPP_DEBUG_LEVEL=3:

[AdaptiveCpp Info] kernel_cache: Registering kernel class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
[AdaptiveCpp Info] kernel_cache: Registering kernel class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
[AdaptiveCpp Info] kernel_cache: Registering kernel class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
[AdaptiveCpp Info] backend_loader: Searching path for backend libs: '"C:\\adaptivecpp\\bin\\hipSYCL"'
[AdaptiveCpp Info] backend_loader: Successfully opened plugin: "C:\\adaptivecpp\\bin\\hipSYCL\\rt-backend-cuda.dll" for backend 'cuda'
[AdaptiveCpp Warning] backend_loader: Could not load library: C:\adaptivecpp\bin\hipSYCL\rt-backend-hip.dll with: 126
[AdaptiveCpp Info] backend_loader: Successfully opened plugin: "C:\\adaptivecpp\\bin\\hipSYCL\\rt-backend-omp.dll" for backend 'omp'
[AdaptiveCpp Info] Registering backend: 'cuda'...
[AdaptiveCpp Info] Registering backend: 'omp'...
[AdaptiveCpp Info] Discovered devices from backend 'CUDA':
[AdaptiveCpp Info]   device 0:
[AdaptiveCpp Info]     vendor: NVIDIA
[AdaptiveCpp Info]     name: NVIDIA GeForce RTX 3080 Laptop GPU
[AdaptiveCpp Info] Discovered devices from backend 'OpenMP':
[AdaptiveCpp Info]   device 0:
[AdaptiveCpp Info]     vendor: the hipSYCL project
[AdaptiveCpp Info]     name: hipSYCL OpenMP host device
[AdaptiveCpp Info] dag_manager: DAG manager is alive!
[AdaptiveCpp Info] runtime: ******* rt launch initiated ********
[AdaptiveCpp Info] queue: Constructed queue with node group id 1
Running on device: NVIDIA GeForce RTX 3080 Laptop GPU

[AdaptiveCpp Info] data_region: constructed with page table dimensions 1 1 1
Initializing data with 1.0f and constructing image pyramid

[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Initializing host access
[AdaptiveCpp Info] accessor [host]: forcing DAG flush for host access...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 1024, 1024} #4 @node 00000238B9F2E8D0
[AdaptiveCpp Info]     --> requires node @00000238B9F2E580 MEM_REQ: Discard W device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] 1. MEM_REQ: R host_buffer {0, 0, 0}+{1, 1024, 1024} #4 @node 00000238B9F2EA70
[AdaptiveCpp Info]     --> requires node @00000238B9F2E8D0 kernel: class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] 2. MEM_REQ: Discard W device {0, 0, 0}+{1, 1024, 1024} #4 @node 00000238B9F2E580
[AdaptiveCpp Info] dag_manager: waiting for async worker...
[AdaptiveCpp Info] dag_manager [async]: Flushing!
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238B8262220
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: Discard W device {0, 0, 0}+{1, 1024, 1024} #4 to 0000000B09800000
[AdaptiveCpp Info] multi_queue_executor: Spawned for backend CUDA with configuration:
[AdaptiveCpp Info]   device 0:
[AdaptiveCpp Info]     memcpy lane: 0
[AdaptiveCpp Info]     memcpy lane: 1
[AdaptiveCpp Info]     kernel lane: 2
[AdaptiveCpp Info]     kernel lane: 3
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2E8D0 with 0 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 2: kernel: class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2E8D0 with 0 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC0DF0: kernel: class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238B8318060
[AdaptiveCpp Info] Identified embedded pointer with uid 2339541634286677568-2017620333385067274 in kernel blob, setting to 0000000B09800000
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R host_buffer {0, 0, 0}+{1, 1024, 1024} #4 to 00000238C07B4048
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2EA70 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 0: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 1024, 1024}-->CPU-Device0 #4 {0, 0, 0}+{1, 1024, 1024}{1, 1024, 1024}
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2EA70 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info]  --> Synchronizes with other queue for node: 00000238B9F2E8D0
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EBFBF0: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 1024, 1024}-->CPU-Device0 #4 {0, 0, 0}+{1, 1024, 1024}{1, 1024, 1024}
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Waiting for completion of host access...
Data at 0 level: 1.000000
[AdaptiveCpp Info] data_region: constructed with page table dimensions 1 1 1
[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Initializing host access
[AdaptiveCpp Info] accessor [host]: forcing DAG flush for host access...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4 @node 00000238C6FCD840
[AdaptiveCpp Info]     --> requires node @00000238C6BA47B0 MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info]     --> requires node @00000238B9F2EC10 MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] 1. MEM_REQ: R host_buffer {0, 0, 0}+{1, 512, 512} #4 @node 00000238C6FCD9E0
[AdaptiveCpp Info]     --> requires node @00000238C6FCD840 kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] 2. MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4 @node 00000238C6BA47B0
[AdaptiveCpp Info] 3. MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4 @node 00000238B9F2EC10
[AdaptiveCpp Info] dag_manager: waiting for async worker...[AdaptiveCpp Info]
dag_manager [async]: Flushing!
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED8B30
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238B8262220
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4 to 0000000B09E00000
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4 to 0000000B09800000
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238C6FCD840 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 3: kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] inorder_executor: Processing node 00000238C6FCD840 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC0F70: kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7DF90
[AdaptiveCpp Info] Identified embedded pointer with uid 5512066146663722016-10088244587580991754 in kernel blob, setting to 0000000B09E00000
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7D310
[AdaptiveCpp Info] Identified embedded pointer with uid 7961992457126735700-10448541353930726666 in kernel blob, setting to 0000000B09800000
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R host_buffer {0, 0, 0}+{1, 512, 512} #4 to 00000238CF918048
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238C6FCD9E0 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 1: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 512, 512}-->CPU-Device0 #4 {0, 0, 0}+{1, 512, 512}{1, 512, 512}
[AdaptiveCpp Info] inorder_executor: Processing node 00000238C6FCD9E0 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info]  --> Synchronizes with other queue for node: 00000238C6FCD840
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC1670: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 512, 512}-->CPU-Device0 #4 {0, 0, 0}+{1, 512, 512}{1, 512, 512}
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Waiting for completion of host access...
Data at 1 level: 0.000000
[AdaptiveCpp Info] data_region: constructed with page table dimensions 1 1 1
[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Initializing host access
[AdaptiveCpp Info] accessor [host]: forcing DAG flush for host access...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 256, 256} #4
   MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4 @node 00000238C7012000
[AdaptiveCpp Info]     --> requires node @00000238B9F2E760 MEM_REQ: Discard W device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info]     --> requires node @00000238B9F2E900 MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info] 1. MEM_REQ: R host_buffer {0, 0, 0}+{1, 256, 256} #4 @node 00000238C70121A0
[AdaptiveCpp Info]     --> requires node @00000238C7012000 kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 256, 256} #4
   MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info] 2. MEM_REQ: Discard W device {0, 0, 0}+{1, 256, 256} #4 @node 00000238B9F2E760
[AdaptiveCpp Info] 3. MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4 @node 00000238B9F2E900
[AdaptiveCpp Info] dag_manager: waiting for async worker...[AdaptiveCpp Info]
dag_manager [async]: Flushing!
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED8C70
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED8B30
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: Discard W device {0, 0, 0}+{1, 256, 256} #4 to 0000000B09F00000
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4 to 0000000B09E00000
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238C7012000 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 2: kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 256, 256} #4
   MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info] inorder_executor: Processing node 00000238C7012000 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC0DF0: kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 256, 256} #4
   MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7DDB0
[AdaptiveCpp Info] Identified embedded pointer with uid 14087086638589750516-14843902856901533962 in kernel blob, setting to 0000000B09F00000
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7D9F0
[AdaptiveCpp Info] Identified embedded pointer with uid 6809362000239585020-16933587375555440394 in kernel blob, setting to 0000000B09E00000
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R host_buffer {0, 0, 0}+{1, 256, 256} #4 to 00000238C6FD1E58
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238C70121A0 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 0: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 256, 256}-->CPU-Device0 #4 {0, 0, 0}+{1, 256, 256}{1, 256, 256}
[AdaptiveCpp Info] inorder_executor: Processing node 00000238C70121A0 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info]  --> Synchronizes with other queue for node: 00000238C7012000
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EBFBF0: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 256, 256}-->CPU-Device0 #4 {0, 0, 0}+{1, 256, 256}{1, 256, 256}
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Waiting for completion of host access...
Data at 2 level: 0.000000
[AdaptiveCpp Info] data_region: constructed with page table dimensions 1 1 1
[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Initializing host access
[AdaptiveCpp Info] accessor [host]: forcing DAG flush for host access...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 128, 128} #4
   MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4 @node 00000238B9F2EAA0
[AdaptiveCpp Info]     --> requires node @00000238C6BA47B0 MEM_REQ: Discard W device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info]     --> requires node @00000238C6FCD9E0 MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info] 1. MEM_REQ: R host_buffer {0, 0, 0}+{1, 128, 128} #4 @node 00000238B9F2EC40
[AdaptiveCpp Info]     --> requires node @00000238B9F2EAA0 kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 128, 128} #4
   MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info] 2. MEM_REQ: Discard W device {0, 0, 0}+{1, 128, 128} #4 @node 00000238C6BA47B0
[AdaptiveCpp Info] 3. MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4 @node 00000238C6FCD9E0
[AdaptiveCpp Info] dag_manager: waiting for async worker...[AdaptiveCpp Info] dag_manager [async]: Flushing!

[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED89F0
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED8C70
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: Discard W device {0, 0, 0}+{1, 128, 128} #4 to 0000000B09F40000
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4 to 0000000B09F00000
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2EAA0 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 3: kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 128, 128} #4
   MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2EAA0 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC0F70: kernel: class `public: <auto> __cdecl `private: class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl ImageLevel::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 128, 128} #4
   MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7E3F0
[AdaptiveCpp Info] Identified embedded pointer with uid 684550788686641508-9007368582400901130 in kernel blob, setting to 0000000B09F40000
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7E350
[AdaptiveCpp Info] Identified embedded pointer with uid 6088901125608948656-288241374103695882 in kernel blob, setting to 0000000B09F00000
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R host_buffer {0, 0, 0}+{1, 128, 128} #4 to 00000238C7012338
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2EC40 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 1: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 128, 128}-->CPU-Device0 #4 {0, 0, 0}+{1, 128, 128}{1, 128, 128}
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2EC40 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info]  --> Synchronizes with other queue for node: 00000238B9F2EAA0
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC1670: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 128, 128}-->CPU-Device0 #4 {0, 0, 0}+{1, 128, 128}{1, 128, 128}
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Waiting for completion of host access...
Data at 3 level: 0.000000

Changing data to 2.0f and deconstructing image pyramid

[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Initializing host access
[AdaptiveCpp Info] accessor [host]: forcing DAG flush for host access...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
   MEM_REQ: W device {0, 0, 0}+{1, 128, 128} #4 @node 00000238B9F2E760
[AdaptiveCpp Info]     --> requires node @00000238C70121A0 MEM_REQ: W device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info] 1. MEM_REQ: R host_buffer {0, 0, 0}+{1, 128, 128} #4 @node 00000238B9F2E900
[AdaptiveCpp Info]     --> requires node @00000238B9F2E760 kernel: class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
   MEM_REQ: W device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info] 2. MEM_REQ: W device {0, 0, 0}+{1, 128, 128} #4 @node 00000238C70121A0
[AdaptiveCpp Info] [AdaptiveCpp Info] dag_manager [async]: Flushing!dag_manager: waiting for async worker...
[AdaptiveCpp Info]
dag_manager [async]: Releasing dead users of data region 00000238C6ED89F0
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: W device {0, 0, 0}+{1, 128, 128} #4 to 0000000B09F40000
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2E760 with 0 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 2: kernel: class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
   MEM_REQ: W device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2E760 with 0 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC0DF0: kernel: class hipsycl::sycl::detail::kernels::fill_kernel<float,2,1,0,5>
   MEM_REQ: W device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7E530
[AdaptiveCpp Info] Identified embedded pointer with uid 16537421174570766940-2233997622329552906 in kernel blob, setting to 0000000B09F40000
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R host_buffer {0, 0, 0}+{1, 128, 128} #4 to 00000238C7012338
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2E900 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 0: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 128, 128}-->CPU-Device0 #4 {0, 0, 0}+{1, 128, 128}{1, 128, 128}
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2E900 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info]  --> Synchronizes with other queue for node: 00000238B9F2E760
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EBFBF0: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 128, 128}-->CPU-Device0 #4 {0, 0, 0}+{1, 128, 128}{1, 128, 128}
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Waiting for completion of host access...
Data at 3 level: 2.000000
[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Initializing host access
[AdaptiveCpp Info] accessor [host]: forcing DAG flush for host access...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 256, 256} #4
   MEM_REQ: R device {0, 0, 0}+{1, 128, 128} #4 @node 00000238C6FCD9E0
[AdaptiveCpp Info]     --> requires node @00000238C6BA47B0 MEM_REQ: W device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info]     --> requires node @00000238B9F2EC40 MEM_REQ: R device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info] 1. MEM_REQ: R host_buffer {0, 0, 0}+{1, 256, 256} #4 @node 00000238B9F2EAA0
[AdaptiveCpp Info]     --> requires node @00000238C6FCD9E0 kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 256, 256} #4
   MEM_REQ: R device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info] 2. MEM_REQ: W device {0, 0, 0}+{1, 256, 256} #4 @node 00000238C6BA47B0
[AdaptiveCpp Info] 3. MEM_REQ: R device {0, 0, 0}+{1, 128, 128} #4 @node 00000238B9F2EC40
[AdaptiveCpp Info] dag_manager: waiting for async worker...[AdaptiveCpp Info] dag_manager [async]: Flushing!
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region
00000238C6ED8C70
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED89F0
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: W device {0, 0, 0}+{1, 256, 256} #4 to 0000000B09F00000
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R device {0, 0, 0}+{1, 128, 128} #4 to 0000000B09F40000
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238C6FCD9E0 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 3: kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 256, 256} #4
   MEM_REQ: R device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info] inorder_executor: Processing node 00000238C6FCD9E0 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC0F70: kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 256, 256} #4
   MEM_REQ: R device {0, 0, 0}+{1, 128, 128} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7C910
[AdaptiveCpp Info] Identified embedded pointer with uid 468731050684875564-16501219823058249994 in kernel blob, setting to 0000000B09F00000
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7D1D0
[AdaptiveCpp Info] Identified embedded pointer with uid 13799425530405593564-12322090477121000202 in kernel blob, setting to 0000000B09F40000
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R host_buffer {0, 0, 0}+{1, 256, 256} #4 to 00000238C6FD1E58
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2EAA0 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 1: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 256, 256}-->CPU-Device0 #4 {0, 0, 0}+{1, 256, 256}{1, 256, 256}
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2EAA0 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info]  --> Synchronizes with other queue for node: 00000238C6FCD9E0
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC1670: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 256, 256}-->CPU-Device0 #4 {0, 0, 0}+{1, 256, 256}{1, 256, 256}
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Waiting for completion of host access...
Data at 2 level: 0.000000
[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Initializing host access
[AdaptiveCpp Info] accessor [host]: forcing DAG flush for host access...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4 @node 00000238B9F2E760
[AdaptiveCpp Info]     --> requires node @00000238C70121A0 MEM_REQ: W device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info]     --> requires node @00000238C6FCD540 MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info] 1. MEM_REQ: R host_buffer {0, 0, 0}+{1, 512, 512} #4 @node 00000238B9F2E900
[AdaptiveCpp Info]     --> requires node @00000238B9F2E760 kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info] 2. MEM_REQ: W device {0, 0, 0}+{1, 512, 512} #4 @node 00000238C70121A0
[AdaptiveCpp Info] 3. MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4 @node 00000238C6FCD540
[AdaptiveCpp Info] dag_manager: waiting for async worker...[AdaptiveCpp Info]
dag_manager [async]: Flushing!
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED8B30
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED8C70
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: W device {0, 0, 0}+{1, 512, 512} #4 to 0000000B09E00000
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4 to 0000000B09F00000
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2E760 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 2: kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2E760 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC0DF0: kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 256, 256} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7E350
[AdaptiveCpp Info] Identified embedded pointer with uid 12358486574225520040-11457364162884489994 in kernel blob, setting to 0000000B09E00000
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7D9F0
[AdaptiveCpp Info] Identified embedded pointer with uid 5945564112202970656-17294016084156180746 in kernel blob, setting to 0000000B09F00000
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R host_buffer {0, 0, 0}+{1, 512, 512} #4 to 00000238CF918048
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2E900 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 0: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 512, 512}-->CPU-Device0 #4 {0, 0, 0}+{1, 512, 512}{1, 512, 512}
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2E900 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info]  --> Synchronizes with other queue for node: 00000238B9F2E760
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EBFBF0: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 512, 512}-->CPU-Device0 #4 {0, 0, 0}+{1, 512, 512}{1, 512, 512}
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Waiting for completion of host access...
Data at 1 level: 0.000000
[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Initializing host access
[AdaptiveCpp Info] accessor [host]: forcing DAG flush for host access...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 1024, 1024} #4
   MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4 @node 00000238B9F2EAA0
[AdaptiveCpp Info]     --> requires node @00000238C6BA47B0 MEM_REQ: W device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info]     --> requires node @00000238C7011E80 MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info] 1. MEM_REQ: R host_buffer {0, 0, 0}+{1, 1024, 1024} #4 @node 00000238B9F2EC40
[AdaptiveCpp Info]     --> requires node @00000238B9F2EAA0 kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 1024, 1024} #4
   MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info] 2. MEM_REQ: W device {0, 0, 0}+{1, 1024, 1024} #4 @node 00000238C6BA47B0
[AdaptiveCpp Info] 3. MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4 @node 00000238C7011E80
[AdaptiveCpp Info] dag_manager: waiting for async worker...[AdaptiveCpp Info]
dag_manager [async]: Flushing!
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238B8262220
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 00000238C6ED8B30
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: W device {0, 0, 0}+{1, 1024, 1024} #4 to 0000000B09800000
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4 to 0000000B09E00000
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2EAA0 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 3: kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 1024, 1024} #4
   MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2EAA0 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC0F70: kernel: class `public: <auto> __cdecl `public: void __cdecl ImageLevel::reconstruct(void) __ptr64'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: W device {0, 0, 0}+{1, 1024, 1024} #4
   MEM_REQ: R device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7C410
[AdaptiveCpp Info] Identified embedded pointer with uid 17186417538373074732-16789455695459464714 in kernel blob, setting to 0000000B09800000
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 00000238C6F7C870
[AdaptiveCpp Info] Identified embedded pointer with uid 14304187443179069760-6485463842251147274 in kernel blob, setting to 0000000B09E00000
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R host_buffer {0, 0, 0}+{1, 1024, 1024} #4 to 00000238C07B4048
[AdaptiveCpp Info] multi_queue_executor: Processing node 00000238B9F2EC40 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 1: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 1024, 1024}-->CPU-Device0 #4 {0, 0, 0}+{1, 1024, 1024}{1, 1024, 1024}
[AdaptiveCpp Info] inorder_executor: Processing node 00000238B9F2EC40 with 1 non-virtual requirement(s) and 1 direct requirement(s).
[AdaptiveCpp Info]  --> Synchronizes with other queue for node: 00000238B9F2EAA0
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 00000238C2EC1670: Memcpy: CUDA-Device0 #4 {0, 0, 0}+{1, 1024, 1024}-->CPU-Device0 #4 {0, 0, 0}+{1, 1024, 1024}{1, 1024, 1024}
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] accessor [host]: Waiting for completion of host access...
Data at 0 level: 0.000000
[AdaptiveCpp Info] buffer_impl::~buffer_impl: Waiting for operations to complete...
[AdaptiveCpp Info] buffer_impl::~buffer_impl: Waiting for operations to complete...
[AdaptiveCpp Info] data_region::~data_region: Freeing allocation 00000238C6FD1E58
[AdaptiveCpp Info] data_region::~data_region: Freeing allocation 0000000B09F00000
[AdaptiveCpp Info] buffer_impl::~buffer_impl: Waiting for operations to complete...
[AdaptiveCpp Info] data_region::~data_region: Freeing allocation 00000238C7012338
[AdaptiveCpp Info] data_region::~data_region: Freeing allocation 0000000B09F40000
[AdaptiveCpp Info] buffer_impl::~buffer_impl: Waiting for operations to complete...
[AdaptiveCpp Info] runtime: ******* rt shutdown ********
[AdaptiveCpp Info] dag_manager: Waiting for async worker...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_manager: Nothing to do
[AdaptiveCpp Info] dag_manager: waiting for async worker...[AdaptiveCpp Info]
[AdaptiveCpp Info] data_region::~data_region: Freeing allocation dag_manager: Shutdown.00000238C07B4048

[AdaptiveCpp Info] data_region::~data_region: Freeing allocation 0000000B09800000
[AdaptiveCpp Info] data_region::~data_region: Freeing allocation 00000238CF918048
[AdaptiveCpp Info] data_region::~data_region: Freeing allocation 0000000B09E00000

@illuhad illuhad added the windows Issues related to using AdaptiveCpp on Windows (which is experimental!) label Mar 18, 2024
@illuhad
Copy link
Collaborator

illuhad commented Mar 18, 2024

Thank you. I also cannot reproduce with --acpp-targets=cuda:

$ ACPP_VISIBILITY_MASK="omp;cuda" ./test
Running on device: NVIDIA RTX A5000

Initializing data with 1.0f and constructing image pyramid

Data at 0 level: 1.000000
Data at 1 level: 1.000000
Data at 2 level: 1.000000
Data at 3 level: 1.000000

Changing data to 2.0f and deconstructing image pyramid

Data at 3 level: 2.000000
Data at 2 level: 2.000000
Data at 1 level: 2.000000
Data at 0 level: 2.000000

It's not unlikely that this might be something that can only be reproduced on Windows, as things are pretty experimental there throughout.
As I don't have a Windows development machine (and I'm not really knowledgeable when it comes to Windows development), my own abilities to reproduce this are limited. But I've added the Windows label to this issue in case someone with more abilities in this area sees it :)

Nothing immediately stands out to me in your output. One thing that could perhaps help is if you could try to narrow down the issue further. If there is some UB on our side that does not show on Linux, the shorter your reproducer is the easier will it be to spot potential issues.

@blinkfrog
Copy link
Contributor Author

Thank you,
I've tried to narrow down this issue further, as you suggested, and found that the problem occurs when downsample function is called as a member function of some class, no matter if it is a static function or not. When this function is a global function, all works well.

This is the new code that I test:

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
#include <string>

const size_t resolution = 1024;

class Downsampler
{
public:
    static sycl::buffer<float, 2> downsample(sycl::buffer<float, 2>& in, sycl::queue& q)
    {
        sycl::buffer<float, 2> out{ in.get_range() / 2 };
        q.submit([&](sycl::handler& h)
        {
            sycl::accessor data_out(out, h, sycl::write_only, sycl::no_init);
            sycl::accessor data_in(in, h, sycl::read_only);
            h.parallel_for(out.get_range(), [=](auto it)
            {
                auto x = it[0] * 2;
                auto y = it[1] * 2;
                data_out[it] = data_in[sycl::id(y, x)];
            });
        });
        return out;
    }
};

sycl::buffer<float, 2> downsample(sycl::buffer<float, 2>& in, sycl::queue& q)
{
    sycl::buffer<float, 2> out{ in.get_range() / 2 };
    q.submit([&](sycl::handler& h)
    {
        sycl::accessor data_out(out, h, sycl::write_only, sycl::no_init);
        sycl::accessor data_in(in, h, sycl::read_only);
        h.parallel_for(out.get_range(), [=](auto it)
        {
            auto x = it[0] * 2;
            auto y = it[1] * 2;
            data_out[it] = data_in[sycl::id(y, x)];
        });
    });
    return out;
}

int main(int argc, char* argv[])
{
    try
    {
        auto d_selector{ sycl::gpu_selector_v };
        sycl::queue q(d_selector);
        std::cout << "Running on device: "
            << q.get_device().get_info<sycl::info::device::name>() << "\n\n";
        sycl::buffer<float, 2> data(sycl::range{ resolution, resolution });
        
        q.submit([&](sycl::handler& h)
        {
            sycl::accessor data_accessor(data, h, sycl::write_only, sycl::no_init);
            h.fill(data_accessor, 1.0f); // initializing original data with 1.0f;
        });

        {
            sycl::host_accessor dh(data, sycl::read_only);
            std::cout << "Original data: " << std::to_string(dh[0][0]) << "\n";
        }

        sycl::buffer<float, 2> downsampled_data = downsample(data, q);
        {
            sycl::host_accessor dh(downsampled_data, sycl::read_only);
            std::cout << "Downsampled data (global function): " << std::to_string(dh[0][0]) << "\n";
        }

        sycl::buffer<float, 2> downsampled_data2 = Downsampler::downsample(data, q);
        {
            sycl::host_accessor dh(downsampled_data2, sycl::read_only);
            std::cout << "Downsampled data (class member): " << std::to_string(dh[0][0]) << "\n";
        }

    }
    catch (sycl::exception const& e)
    {
        std::cout << "An exception is caught: " << e.what();
        std::terminate();
    }

    return 0;
}

This is what I get (should be 1.0 everywhere):

Running on device: NVIDIA GeForce RTX 3080 Laptop GPU

Original data: 1.000000
Downsampled data (global function): 1.000000
Downsampled data (class member): 0.000000

This is quite strange since I have used member functions before without problems and that my project where I encountered this problem stores its main buffers in a class and processes them using member functions, and I have never noticed any problems with these long-living buffers.

I am wondering if this issue could be related to the fact that I use CUDA 12.4 (or 12.2 recently). When I compile any AdaptiveCpp project, I get this warning:

C:\Temp\sycltest\PyramidTest\out\build\x64-Release\clang : warning : CUDA version is newer than the latest partially supported version 12.1 [-Wunknown-cuda-version]

@blinkfrog
Copy link
Contributor Author

And this issue isn't related to how buffers are passed and returned. If I add data output (using host accessor) right into the body of member downsample function, I get the same incorrect results (correct in, incorrect out).

So something doesn't work right in the body of that function.

This is the debug log of what occurs there:

[AdaptiveCpp Info] hiplike_kernel_launcher: Submitting high-level parallel for with selected total group size of 256
[AdaptiveCpp Info] dag_manager: Checking DAG flush opportunity...
[AdaptiveCpp Info] dag_manager: Submitting asynchronous flush...
[AdaptiveCpp Info] dag_builder: DAG contains operations:
[AdaptiveCpp Info] 0. kernel: class `public: <auto> __cdecl `public: static class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl Downsampler::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64)'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4 @node 000001DB28DA8050
[AdaptiveCpp Info]     --> requires node @000001DB35CD8530 MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
[AdaptiveCpp Info]     --> requires node @000001DB35CD8120 MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] 1. MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4 @node 000001DB35CD8530
[AdaptiveCpp Info] 2. MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4 @node 000001DB35CD8120
[AdaptiveCpp Info] dag_manager: waiting for async worker...[AdaptiveCpp Info]
dag_manager [async]: Flushing!
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 000001DB35BE4740
[AdaptiveCpp Info] dag_manager [async]: Releasing dead users of data region 000001DB26DE9B80
[AdaptiveCpp Info] dag_manager [async]: Submitting node to scheduler!
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4 to 0000000B09F00000
[AdaptiveCpp Info] dag_direct_scheduler: Setting device data pointer of requirement node MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4 to 0000000B09800000
[AdaptiveCpp Info] multi_queue_executor: Processing node 000001DB28DA8050 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] multi_queue_executor: Dispatching to lane 2: kernel: class `public: <auto> __cdecl `public: static class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl Downsampler::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64)'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] inorder_executor: Processing node 000001DB28DA8050 with 0 non-virtual requirement(s) and 2 direct requirement(s).
[AdaptiveCpp Info] inorder_executor: Dispatching to lane 000001DB31BC79F0: kernel: class `public: <auto> __cdecl `public: static class hipsycl::sycl::buffer<float,2,class std::allocator<float> > __cdecl Downsampler::downsample(class hipsycl::sycl::buffer<float,2,class std::allocator<float> > & __ptr64,class hipsycl::sycl::queue & __ptr64)'::`1'::<lambda_1>::operator()(class hipsycl::sycl::handler & __ptr64)const __ptr64'::`1'::<lambda_1>
   MEM_REQ: Discard W device {0, 0, 0}+{1, 512, 512} #4
   MEM_REQ: R device {0, 0, 0}+{1, 1024, 1024} #4
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 000001DB35C8ED90
[AdaptiveCpp Info] Identified embedded pointer with uid 823317228869893804-1297259896214424020 in kernel blob, setting to 0000000B09F00000
[AdaptiveCpp Info] buffer_memory_requirement: Attempting to initialize embedded pointers for requirement 000001DB35C8E7F0
[AdaptiveCpp Info] Identified embedded pointer with uid 5290992514806124580-18158588467825141972 in kernel blob, setting to 0000000B09800000
[AdaptiveCpp Info] dag_manager [async]: DAG flush complete.

@illuhad
Copy link
Collaborator

illuhad commented Mar 19, 2024

I don't know what the options on Windows are - can you use cuda-memcheck or similar tooling to check whether there is something incorrect going on in that kernel, or with that kernel launch?

@blinkfrog
Copy link
Contributor Author

Thanks for suggestion, I've got some interesting information!

I've tried to use Nsight VSE Debugger which is integrated to VS. I've tried to compile my Release build with -g added, but, strangely, .pdb-file wasn't generated (will investigate this). So I switched to RelWithDebInfo, and got it, and was able to step into the kernels. But it all worked fine!

So, to sum up, when I use RelWithDebInfo build, there is no this issue, but if I switch to Release, then it occurs. All values of optimization flag -O change nothing in this behavior in both Release and RelWithDebInfo.

@nilsfriess
Copy link
Collaborator

How have you built LLVM and AdaptiveCpp? I remember vaguely that the optimisation levels between LLVM and AdaptiveCpp have to match on Windows, maybe this is a similar problem?

@blinkfrog
Copy link
Contributor Author

blinkfrog commented Mar 19, 2024

How have you built LLVM and AdaptiveCpp? I remember vaguely that the optimisation levels between LLVM and AdaptiveCpp have to match on Windows, maybe this is a similar problem?

I compiled them both using Release configuration, with no -O flag set (default is -O2?).

@blinkfrog
Copy link
Contributor Author

Additional, probably, related information. Now I compile all projects with RelWithDebInfo configuration, and they work just fine with CUDA and other backends. However, I decided to check again if they would work with Release configuration, and, no, I encounter problems. Some other project (which I won't be able to reduce, it is too large), which works without problems when compiled with RelWithDebInfo, throws asynchronous SYCL exceptions:

[18:18:14] ERR Caught asynchronous SYCL exception:
from D:/source/AdaptiveCpp/src/runtime/cuda/cuda_allocator.cpp:48 @ allocate(): cuda_allocator: cudaMalloc() failed (error code = CUDA:700)

[18:18:14] ERR Caught asynchronous SYCL exception:
from D:/source/AdaptiveCpp/src/runtime/dag_direct_scheduler.cpp:113 @ ensure_allocation_exists(): dag_direct_scheduler: Lazy memory allocation has failed.

Interesting that I have synchronous SYCL exception handlers as well, but they catch nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion General discussion about something windows Issues related to using AdaptiveCpp on Windows (which is experimental!)
Projects
None yet
Development

No branches or pull requests

3 participants