New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linearizer: enable GROUP opts after TC #4408
base: master
Are you sure you want to change the base?
Conversation
8fb694c
to
57fe812
Compare
I'm a little torn on this PR. I know it can help certain kernels, but it's making fast kernels just slightly faster at the cost of expanding the action space. It is a correct action though :/ Running a deeper beam to show potential performance gains and also a |
cdf40ab
to
29091bc
Compare
updating the timing for master. given that this makes CUDA slower, it probably needs to be held behind a flag :( |
can you isolate the reversing local_idxs order part? |
71a1471
to
dd798c3
Compare
fixes strides on group_for_reduce buffer to enable GROUP to be used correctly after TC. also reverses the local index order to be first ones on the left. fix tests that depend on local order and remove duplicated test
dd798c3
to
74d7c0d
Compare
Changes
|
fyi tried this on latest master, does not seem to be faster with current beam settings |
fixes strides on group_for_reduce buffer to enable GROUP to be used correctly after TC. also reverses the local index order to be first ones on the left, new LOCALs on the right before
self.first_reduce
.these tests are versus master 2024-05-05 at 7094100
for tiny red /
HSA=1
:for tiny green /
CUDA=1
:*large beam is:
TRAIN_BEAM=8 IGNORE_JIT_FIRST_BEAM=1 BEAM_UOPS_MAX=2500 BEAM_UPCAST_MAX=128 BEAM_LOCAL_MAX=1024 BEAM_MIN_PROGRESS=5 BEAM_PADTO=0