Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linearizer: enable GROUP opts after TC #4408

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

flammit
Copy link
Contributor

@flammit flammit commented May 3, 2024

fixes strides on group_for_reduce buffer to enable GROUP to be used correctly after TC. also reverses the local index order to be first ones on the left, new LOCALs on the right before self.first_reduce.

these tests are versus master 2024-05-05 at 7094100

for tiny red / HSA=1:

branch settings setup (mm:ss) step time (ms) flops (TF)
master default 25:30 319ms 119TF
master *large beam 47:59 308ms 123TF
enable_group_after_tc default 25:52 314ms 121TF
enable_group_after_tc *large beam 48:02 308ms 123TF

for tiny green / CUDA=1:

branch settings setup (mm:ss) step time (ms) flops (TF)
master default 27:56 229ms 166TF
master *large beam 76:23 223ms 170TF
enable_group_after_tc default 26:35 233ms 163TF
enable_group_after_tc *large beam 78:37 227ms 167TF

*large beam is: TRAIN_BEAM=8 IGNORE_JIT_FIRST_BEAM=1 BEAM_UOPS_MAX=2500 BEAM_UPCAST_MAX=128 BEAM_LOCAL_MAX=1024 BEAM_MIN_PROGRESS=5 BEAM_PADTO=0

@flammit flammit marked this pull request as draft May 3, 2024 19:49
@flammit flammit force-pushed the enable_group_after_tc branch 2 times, most recently from 8fb694c to 57fe812 Compare May 3, 2024 20:48
@flammit flammit marked this pull request as ready for review May 3, 2024 22:00
@flammit flammit changed the title [DRAFT] linearizer: enable GROUP opts after TC linearizer: enable GROUP opts after TC May 3, 2024
@flammit
Copy link
Contributor Author

flammit commented May 3, 2024

I'm a little torn on this PR. I know it can help certain kernels, but it's making fast kernels just slightly faster at the cost of expanding the action space. It is a correct action though :/

Running a deeper beam to show potential performance gains and also a DEPTH=4 FUZZ_ALL_ACTIONS=1 FUZZ_N=1024 FUZZ_REQUIRE_TC=1 METAL=1 with fuzz_linearizer.

@flammit flammit marked this pull request as draft May 3, 2024 22:04
@flammit flammit force-pushed the enable_group_after_tc branch 4 times, most recently from cdf40ab to 29091bc Compare May 6, 2024 02:33
@flammit
Copy link
Contributor Author

flammit commented May 6, 2024

updating the timing for master. given that this makes CUDA slower, it probably needs to be held behind a flag :(

@flammit flammit marked this pull request as ready for review May 6, 2024 03:24
@chenyuxyz
Copy link
Collaborator

can you isolate the reversing local_idxs order part?

@flammit flammit marked this pull request as draft May 6, 2024 15:54
@flammit flammit force-pushed the enable_group_after_tc branch 4 times, most recently from 71a1471 to dd798c3 Compare May 7, 2024 20:15
fixes strides on group_for_reduce buffer to enable GROUP to be
used correctly after TC.  also reverses the local index order to
be first ones on the left.

fix tests that depend on local order and remove duplicated test
Copy link
Contributor

github-actions bot commented May 9, 2024

Changes

Name                              Lines    Diff    Tokens/Line    Diff
------------------------------  -------  ------  -------------  ------
tinygrad/codegen/kernel.py          463      +0           18.4    +0.1
tinygrad/codegen/linearizer.py      328      +3           19.6    +0.2


total lines changes: +3

@chenyuxyz
Copy link
Collaborator

fyi tried this on latest master, does not seem to be faster with current beam settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants