Kernel attributes and launchbounds__ Feature #1328

Jiuxiaoyunhai · 2024-01-24T04:52:35Z

Describe the motivation for the feature request

As shown below:
1. how do i map launch_bounds to AdaptiveCpp interfaces for HIP Code Optimization ?
2. It's enbaled to check registers and whose spilling for Code Optimization in CUDA and HIP, is this feature accessiable using AdaptiveCpp interfaces ?

__global__ __launch_bounds__(256, 1) 
void ReconstructFluxXShared(Real* UI, Real* LU, Real* eigen_local, Real*  rho, Real* u, Real* v, Real* w, Real*  H, Real*  p, Real const dx){}
    #if USE_HIP
    #define FuncAttributes hipFuncAttributes
    #elif USE_CUDA
    #define FuncAttributes cudaFuncAttributes
    #endif
    FuncAttributes attr;
    for(int i=0; i<10; i++)
        ReconstructFluxXShared<<<gridsize, blocksize>>>(d_U, d_FluxF, d_wallFluxF, d_eigen_local, d_fstate.rho, d_fstate.u, d_fstate.v, d_fstate.w, d_fstate.H, 0.01);
    // check register and shared mem
    printf(">>>>> Kernel name: ReconstructFluxXShared\n");
    #if USE_HIP
    hipFuncGetAttributes(&attr, (const void *)ReconstructFluxXShared);
    #elif USE_CUDA
    cudaFuncGetAttributes(&attr, ReconstructFluxXShared);
    #endif
    printf("Local mem usage: %d byte;          shared mem usage: %d bytes \n", (int)(attr.localSizeBytes), (int)(attr.sharedSizeBytes));
    printf("Constant mem usage: %d bytes;      register usage: %d \n", (int)(attr.constSizeBytes), attr.numRegs);
    printf("Max dynamic shared mem: %d bytes \n", attr.maxDynamicSharedSizeBytes);
    printf("Max threads per block: %d \n", attr.maxThreadsPerBlock);

The text was updated successfully, but these errors were encountered:

al42and · 2024-01-25T20:07:55Z

Related: #714

illuhad · 2024-01-25T22:39:32Z

For 1: This will be handled automatically in the generic SSCP compiler soonish. No need for user-provided hints.
For 2: This is not currently possible. What do you do with those values once you have obtained them?

Jiuxiaoyunhai · 2024-01-26T04:41:16Z

For 1: This will be handled automatically in the generic SSCP compiler soonish. No need for user-provided hints. For 2: This is not currently possible. What do you do with those values once you have obtained them?

I'm optimizing register comsumption of SYCL code for higher performance running on HIP Devices, GPUs' max registers allowed in a block usuallly achieves at 256*256. It's found that only the product threads of a block is an integer multiple of 256 coming to the theoretical performance peak of AMD Devices, that means the register comsumption of a kernel has to be less than 256.

al42and · 2024-01-26T11:28:38Z

This is not currently possible. What do you do with those values once you have obtained them?

I'm optimizing register comsumption of SYCL code for higher performance running on HIP Devices, GPUs' max registers allowed in a block usuallly achieves at 256*256. It's found that only the product threads of a block is an integer multiple of 256 coming to the theoretical performance peak of AMD Devices, that means the register comsumption of a kernel has to be less than 256.

Expanding on that, it could be a useful sanity check for SSCP. For known devices / multipass compilation, developers can manually check the register pressure at compile time (-save-temps + grep 'Num[SV]gprs\|ScratchSize' with HIP, -Xcuda-ptxas -v for CUDA), or at run time by using vendor profiling tool. That's no longer possible with JITting on user machines; and even on developer machines, one would have to actually run the application to see the register usage, not just build it. And while end-users won't be able to do much about register spills, at least it would result in useful bugreports, instead of just "code is slow on GPU X with software Y".

illuhad · 2024-01-26T14:31:40Z

This is a bit difficult to do.

In the non-SSCP case, the runtime does not actually see the HIP __global__ function, and there's currently no way to relay e.g. that function pointer back to the runtime.
In the SSCP case, the idea is that SSCP will become a massive autotuning platform, automatically generating increasingly specialized kernels at JIT time (also considering work group size as a tuning parameter). For this JIT needs to happen at a late stage, which prevents us from exposing these metrics in sycl::kernel_bundle which is what SYCL actually envisions for that. I suppose it might be possible to add a property to the submit which could allow users to relay a pointer to a struct to the runtime, which the runtime can then fill with these metrics at JIT time. Just for tuning group sizes though that should eventually not be needed as the autotuning framework should eventually do that by itself.

illuhad · 2024-02-09T15:29:02Z

__launch_bounds__ is addressed by PR #1347.

Jiuxiaoyunhai added the enhancement New feature or request label Jan 24, 2024

Jiuxiaoyunhai changed the title ~~Kernel attributes Feature~~ Kernel attributes and __launch__bounds__ Feature Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel attributes and launchbounds__ Feature #1328

Kernel attributes and launchbounds__ Feature #1328

Jiuxiaoyunhai commented Jan 24, 2024 •

edited by fodinabor

al42and commented Jan 25, 2024

illuhad commented Jan 25, 2024

Jiuxiaoyunhai commented Jan 26, 2024 •

edited

al42and commented Jan 26, 2024 •

edited

illuhad commented Jan 26, 2024 •

edited

illuhad commented Feb 9, 2024

Kernel attributes and __launch__bounds__ Feature #1328

Kernel attributes and __launch__bounds__ Feature #1328

Comments

Jiuxiaoyunhai commented Jan 24, 2024 • edited by fodinabor

al42and commented Jan 25, 2024

illuhad commented Jan 25, 2024

Jiuxiaoyunhai commented Jan 26, 2024 • edited

al42and commented Jan 26, 2024 • edited

illuhad commented Jan 26, 2024 • edited

illuhad commented Feb 9, 2024

Kernel attributes and launchbounds__ Feature #1328

Kernel attributes and launchbounds__ Feature #1328

Jiuxiaoyunhai commented Jan 24, 2024 •

edited by fodinabor

Jiuxiaoyunhai commented Jan 26, 2024 •

edited

al42and commented Jan 26, 2024 •

edited

illuhad commented Jan 26, 2024 •

edited