Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multireduce Kernels: Allow IF blocks to terminate early #4457

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

0xtimmy
Copy link
Contributor

@0xtimmy 0xtimmy commented May 6, 2024

this pr modifies UOpGraph.add_ends() so that it doesn't double add ENDIFs inserted by the linearizer
this will let the linearizer terminate if blocks early when it has to render another reduceop

ex. in standard deviation

__kernel void r_128_16_4n3(__global float* data0, const __global float* data1) {
  ...                                                 // first reduceop
  temp[lidx1] = acc0;
  barrier(CLK_LOCAL_MEM_FENCE);
  if (alu1) {
    float acc1 = 0.0f;
    for (int ridx1 = 0; ridx1 < 16; ridx1++) {
      float val1 = temp[ridx1];
      acc1 = (val1+acc1);
    }
    temp[0] = acc1;
  }                                                   // terminate the if block early
  barrier(CLK_LOCAL_MEM_FENCE);                       // put the ENDIF in the vin of this barrier
  float val2 = temp[0];
  float acc2 = 0.0f;
  for (int ridx2 = 0; ridx2 < 4; ridx2++) {           // second reduceop
    float val3 = data1[alu0+ridx2];
    acc2 = ((val3-(val2*0.015625f))+acc2);
  }
  temp[lidx1] = acc2;
  ...
}

the linearizer will have to also put the ENDIF in a vin so it doesn't get optimized away by UOpGraph.remove_childless(). I think putting it in the vin of the barrier makes the most sense

@0xtimmy 0xtimmy marked this pull request as ready for review May 6, 2024 20:34
@Qazalin
Copy link
Collaborator

Qazalin commented May 6, 2024

This diff suggests:

  1. Linearizer is now dealing with ENDIF
  2. The specification of a UOp's vins changes

Is there a way to isolate this behavior in def add_ends(self):?

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 6, 2024

i can experiment with it, do you want me to move to draft in the meantime?

@0xtimmy 0xtimmy marked this pull request as draft May 7, 2024 23:33
@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 8, 2024

so, since IF statements aren't dependencies like loop we will have to change something in the linearizer. but we will have to change the linearizer anyways to allow the results of one reduceop to be loaded back into every thread for a potential next reduceop.

general pattern is:

IF
...
ENDLOOP
STORE
ENDIF
BARRIER
LOAD
  1. we could add the IF to the vin of the endif or the barrier to signal the end of the if block but that would be changing the spec of the vin as well

  2. we could look for this pattern explicitly, or at least the IF-STORE-BARRIER-LOAD; it is ingenuine to the use of IFs in general, but it's at least more representative than simply tacking ENDIF onto the end of the kernel

I've drafted both of these systems locally b/c they aren't that complex:

  1. works if we add the IF to the BARRIER because like a STORE, the BARRIER has to come after the ENDIF. STORE has a more meaningful vin spec so modifying it is def a bad idea
  2. just works

@Qazalin
Copy link
Collaborator

Qazalin commented May 8, 2024

UOps always form a graph.
I think we can reason about this as each consecutive barrier depends on the previous (BARRIER -> IF) block. I'm more interested in a diff that leverages IF/prev BARRIER instead of manually inserting ENDIF.

It's in the design of UOps.END* to be childless/graph breaking.
We have to refactor IF/BARRIERS since we need to UOp more than one of them, I think we need a vin change somewhere.

Copy link
Contributor

github-actions bot commented May 8, 2024

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

@Qazalin
Copy link
Collaborator

Qazalin commented May 8, 2024

nice, is this ready to test end-to-end? Can you merge it with #4259

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 8, 2024

I can, I like this more: the IF is used as a barrier for the local global_load; so instead of just finding the next BARRIER, this finds the BARRIER used after storing the PHI of the local reduce

seems a bit more robust?

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 8, 2024

https://github.com/tinygrad/tinygrad/pull/4259/files#r1594119777

regardless it will need this chunk of code to test end-to-end.Changes have been made (like getting rid of the explicit ENDIF) so I can push those and then put the tests on that pr

@Qazalin
Copy link
Collaborator

Qazalin commented May 9, 2024

@0xtimmy Can you make it works with the BARRIER one 18641c9

There are two merge blockers for this:

  1. Can you update Multireduce-Kernels: Linearizer Changes and Tests #4259 with this diff?
  2. test_early_endif doesn't test the new behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants