Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dynamic offsets to DefaultEpilogue #1274

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ezhulenev
Copy link
Contributor

Dynamic offsets in DefaultEpilogue allows to move pointer arithmetics to device and shift C and D pointers based on offsets stored in device memory.

Depends on #1273

Dynamic offsets in `DefaultEpilogue` allows to move pointer arithmetics to device and shift `C` and `D` pointers based on offsets stored in device memory.
@hwu36
Copy link
Collaborator

hwu36 commented Dec 19, 2023

@thakkarV

@ezhulenev ,what is the use case of this?

@ezhulenev
Copy link
Contributor Author

In XLA inside loops (and in general inside control flow) we keep buffer offsets on device, this for example allows to put two gemms writing at different offsets calculated at run time into different If branches and capture both of them into single cuda graph (with conditional graphs added in cuda 12.3). Without dynamic offsets we would be forced to move offset value to host and build multiple cuda graphs.

@hwu36
Copy link
Collaborator

hwu36 commented Dec 19, 2023

In XLA inside loops (and in general inside control flow) we keep buffer offsets on device, this for example allows to put two gemms writing at different offsets calculated at run time into different If branches and capture both of them into single cuda graph (with conditional graphs added in cuda 12.3). Without dynamic offsets we would be forced to move offset value to host and build multiple cuda graphs.

how does XLA handle this now when not using cutlass?

@ezhulenev
Copy link
Contributor Author

Well… it doesn’t that’s why I’m looking at adding cutlass :) it does it for non-gemm computations by compiling kernels, but for cuBLAS for example we are forced to materialize temporary buffers at known offsets, and overhead adds up

@hwu36
Copy link
Collaborator

hwu36 commented Dec 19, 2023

gotcha. thanks.

@ezhulenev
Copy link
Contributor Author

I'm also considering keeping it in XLA as template specialization as this is a little bit too xla specific (especially int32_t offsets, in general int64 makes more sense, but harder to target from XLA).

@hwu36
Copy link
Collaborator

hwu36 commented Dec 20, 2023

@kadeng , does torch have this need?

@kadeng
Copy link

kadeng commented Dec 20, 2023

@kadeng , does torch have this need?

Not at this moment, but the argument about improving Cudagraph reuseability appears compelling.

@ezhulenev
Copy link
Contributor Author

I implemented this inside XLA with template specializations here: openxla/xla#7916, so I don't need it in CUTLASS right now, but in general I think it would be very useful if dynamic offsets could work with epilogues and also inputs (I didn't look how to make it work with TMA), and making it more generic and less XLA-focused is worthwhile. Mostly because of the CUDA graphs, they are getting more powerful with every CUDA release, and with on-device control flow this is really handy.

@thakkarV
Copy link
Collaborator

thakkarV commented Jan 8, 2024

Hello! Before we go ahead with accepting this MR for the default epi, I wanted to ask some questions about its generality to some other epilogues we have. Default epi is what we call a direct store epilogue, which uses no shared memory, and therefore cannot swizzle its output stores, leading to suboptimal perf. Additionally it does not support fusions via EVT. This epi was designed as a vanilla epilogue to aid in development of mainloops and is mostly a debugging tool rather than a zippy fusion+store API.

We recommend using the TMA EVT epilogue on SM90 for best perf, or the sm70 vectorized epi on non-TMA architectures via the 3.x API. If we were to accept this MR, would you see it used in production workloads despite its suboptimal perf, or shall we discuss ways to generalize this to all of our epilogues including the performant ones?

@ezhulenev
Copy link
Contributor Author

This is a general feature that we'd need for inputs and outputs (epilogues), we know the "base" address at run time when we prepare TMA descriptors (when they are initialized inside CUTLASS from arguments), but the real address of input/output buffers can depend on offsets computed on device (strides are known ahead of time).

In extreme we can always set "base" address to 0 (nullptr) and reuse the same CUDA graph with a cutlass kernel for all problems of the same shape at different memory locations.

Copy link

github-actions bot commented Feb 7, 2024

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

Copy link

github-actions bot commented May 7, 2024

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants