Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel attributes and __launch__bounds__ Feature #1328

Open
Jiuxiaoyunhai opened this issue Jan 24, 2024 · 6 comments
Open

Kernel attributes and __launch__bounds__ Feature #1328

Jiuxiaoyunhai opened this issue Jan 24, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@Jiuxiaoyunhai
Copy link

Jiuxiaoyunhai commented Jan 24, 2024

Describe the motivation for the feature request

As shown below:
1. how do i map launch_bounds to AdaptiveCpp interfaces for HIP Code Optimization ?
2. It's enbaled to check registers and whose spilling for Code Optimization in CUDA and HIP, is this feature accessiable using AdaptiveCpp interfaces ?

__global__ __launch_bounds__(256, 1) 
void ReconstructFluxXShared(Real* UI, Real* LU, Real* eigen_local, Real*  rho, Real* u, Real* v, Real* w, Real*  H, Real*  p, Real const dx){}
    #if USE_HIP
    #define FuncAttributes hipFuncAttributes
    #elif USE_CUDA
    #define FuncAttributes cudaFuncAttributes
    #endif
    FuncAttributes attr;
    for(int i=0; i<10; i++)
        ReconstructFluxXShared<<<gridsize, blocksize>>>(d_U, d_FluxF, d_wallFluxF, d_eigen_local, d_fstate.rho, d_fstate.u, d_fstate.v, d_fstate.w, d_fstate.H, 0.01);
    // check register and shared mem
    printf(">>>>> Kernel name: ReconstructFluxXShared\n");
    #if USE_HIP
    hipFuncGetAttributes(&attr, (const void *)ReconstructFluxXShared);
    #elif USE_CUDA
    cudaFuncGetAttributes(&attr, ReconstructFluxXShared);
    #endif
    printf("Local mem usage: %d byte;          shared mem usage: %d bytes \n", (int)(attr.localSizeBytes), (int)(attr.sharedSizeBytes));
    printf("Constant mem usage: %d bytes;      register usage: %d \n", (int)(attr.constSizeBytes), attr.numRegs);
    printf("Max dynamic shared mem: %d bytes \n", attr.maxDynamicSharedSizeBytes);
    printf("Max threads per block: %d \n", attr.maxThreadsPerBlock);


@Jiuxiaoyunhai Jiuxiaoyunhai added the enhancement New feature or request label Jan 24, 2024
@Jiuxiaoyunhai Jiuxiaoyunhai changed the title Kernel attributes Feature Kernel attributes and __launch__bounds__ Feature Jan 24, 2024
@al42and
Copy link
Contributor

al42and commented Jan 25, 2024

Related: #714

@illuhad
Copy link
Collaborator

illuhad commented Jan 25, 2024

For 1: This will be handled automatically in the generic SSCP compiler soonish. No need for user-provided hints.
For 2: This is not currently possible. What do you do with those values once you have obtained them?

@Jiuxiaoyunhai
Copy link
Author

Jiuxiaoyunhai commented Jan 26, 2024

For 1: This will be handled automatically in the generic SSCP compiler soonish. No need for user-provided hints. For 2: This is not currently possible. What do you do with those values once you have obtained them?

I'm optimizing register comsumption of SYCL code for higher performance running on HIP Devices, GPUs' max registers allowed in a block usuallly achieves at 256*256. It's found that only the product threads of a block is an integer multiple of 256 coming to the theoretical performance peak of AMD Devices, that means the register comsumption of a kernel has to be less than 256.

@al42and
Copy link
Contributor

al42and commented Jan 26, 2024

This is not currently possible. What do you do with those values once you have obtained them?

I'm optimizing register comsumption of SYCL code for higher performance running on HIP Devices, GPUs' max registers allowed in a block usuallly achieves at 256*256. It's found that only the product threads of a block is an integer multiple of 256 coming to the theoretical performance peak of AMD Devices, that means the register comsumption of a kernel has to be less than 256.

Expanding on that, it could be a useful sanity check for SSCP. For known devices / multipass compilation, developers can manually check the register pressure at compile time (-save-temps + grep 'Num[SV]gprs\|ScratchSize' with HIP, -Xcuda-ptxas -v for CUDA), or at run time by using vendor profiling tool. That's no longer possible with JITting on user machines; and even on developer machines, one would have to actually run the application to see the register usage, not just build it. And while end-users won't be able to do much about register spills, at least it would result in useful bugreports, instead of just "code is slow on GPU X with software Y".

@illuhad
Copy link
Collaborator

illuhad commented Jan 26, 2024

This is a bit difficult to do.

  • In the non-SSCP case, the runtime does not actually see the HIP __global__ function, and there's currently no way to relay e.g. that function pointer back to the runtime.
  • In the SSCP case, the idea is that SSCP will become a massive autotuning platform, automatically generating increasingly specialized kernels at JIT time (also considering work group size as a tuning parameter). For this JIT needs to happen at a late stage, which prevents us from exposing these metrics in sycl::kernel_bundle which is what SYCL actually envisions for that. I suppose it might be possible to add a property to the submit which could allow users to relay a pointer to a struct to the runtime, which the runtime can then fill with these metrics at JIT time. Just for tuning group sizes though that should eventually not be needed as the autotuning framework should eventually do that by itself.

@illuhad
Copy link
Collaborator

illuhad commented Feb 9, 2024

__launch_bounds__ is addressed by PR #1347.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants