Multireduce-Kernels: Linearizer Changes and Tests #4259

0xtimmy · 2024-04-23T03:21:11Z

This is the linearizer stage of the multireduce kernels PR

It contains changes to linearizer.py, kernel.py, and uops.py to allows them to produce kernels with multiple reductions
Test cases have been added to test_linearizer.py that check a few things:

simple fusion of reduceops
fusion with intermediate calculations (both at the output shape and full shape)
fusion with multiple outputs (outputs must be the same shape)
re: @Qazalin , I didn't add your tests in directly because this PR doesn't change the scheduler, but I did put in one case that uses the same ast, there were a few issues 1) was the ordering of reduceops and 2) that intermediate values would sometimes get rendered within a for loop. I fixed both

The main changes to linearizer.py are touching up spots that depend on there being only one reduceop, for example storing the result of a local reduce in a temp buffer so it can be loaded back into every thread
I also changed kernel.py so the "deeper" reduceops are rendered first
And I had to change uops.py so that if statements can be closed before the end of the kernel

Qazalin · 2024-04-23T11:02:27Z

tinygrad/codegen/kernel.py

-    reduceops = [x for x in self.lazyops if x.op in ReduceOps]
-    assert len(dedup(reduceops)) <= 1, "max one reduce op in an ast"
-    self.reduceop = reduceops[0] if reduceops else None
+    def get_rchildren(x:LazyOp): return ([x] if x.op in ReduceOps else []) + [r for s in x.src for r in get_rchildren(s)]


import time from tinygrad.codegen.linearizer import Linearizer from tinygrad.engine.schedule import create_schedule from tinygrad.tensor import Tensor st = time.perf_counter() a = Tensor([1,2,3,4]) for _ in range(24): a = a + a r = a.sum() ast = create_schedule([r.lazydata])[-1].ast lin = Linearizer(*ast) assert time.perf_counter()-st < 1.0

you should keep track of visited nodes.

I futzed around with it and the traversal seems to just be too slow in general so instead of reordering them when the kernel get initialized, I render them in ast_parse because ast_parse is recursing through the ast anyways.

I did have to make some changes to uops.py so I'm making some more tests to make sure things still linearize properly

Qazalin · 2024-04-26T13:26:12Z

This diff is very hard to review, can you break it down to isolated PRs?

Qazalin · 2024-05-01T10:49:48Z

@0xtimmy can you provide a status update?
What are the blockers to getting the linearizer changes merged?

0xtimmy · 2024-05-01T10:53:18Z

finals week, I can have a PR up probably tomorrow

0xtimmy · 2024-05-03T03:32:07Z

in regards to @chenyuxyz 's comment abt using insert_before I came up with two additional solutions:

we can "chunk" each reduceop: so keep rendering them in ast parse and just place the chunk of new uops before any existing loop context
we can render all the recudeops out of order with some placeholder for each accumulator, then move them into the proper order and switch out the placeholder during the main ast_parse

just some ideas I had, I can put up PRs once they're tested

chaosagent

small things

chaosagent · 2024-05-04T09:56:05Z

tinygrad/codegen/kernel.py

-    reduceops = [x for x in self.lazyops if x.op in ReduceOps]
-    assert len(dedup(reduceops)) <= 1, "max one reduce op in an ast"
-    self.reduceop = reduceops[0] if reduceops else None
+    self.reduceops = list(set([x for x in self.lazyops if x.op in ReduceOps]))


we have tinygrad.helpers.dedup

chaosagent · 2024-05-04T09:56:19Z

tinygrad/codegen/kernel.py

-    ret.reduceop, ret.outbufs, ret.vars, ret.bufs, ret.earlybufs, ret.full_buf_index = \
-      self.reduceop, self.outbufs, self.vars, [x for x in self.bufs if not isinstance(x, LocalBuffer)], self.earlybufs, self.full_buf_index
+    ret.reduceops, ret.outbufs, ret.vars, ret.bufs, ret.earlybufs, ret.full_buf_index = \
+      self.reduceops, self.outbufs, self.vars, [x for x in self.bufs if not isinstance(x, LocalBuffer)], self.earlybufs, self.full_buf_index


ret.reduceops = self.reduceops[:]

nevermind, reduceops is not mutated

0xtimmy · 2024-05-20T23:04:11Z

uops.py got a rewrite fyi - I think this diff requires some changes. The tests are almost there, can you create separate PR for getting the tests upstreamed? I'd add tests for 2+ back to back reduces as well. We need a robust way for testing loop scopes.

I made a separate pr for the tests here: #4665

Qazalin · 2024-05-21T00:26:56Z

wanna delete the skips too?

Qazalin · 2024-05-21T00:28:19Z

tinygrad/codegen/linearizer.py

@@ -390,21 +401,20 @@ def linearize(self):

    # parse AST
    loaded_buffers:Dict[Union[MemBuffer, ConstBuffer, LocalBuffer], List[UOp]] = {}
-    acc: List[UOp] = []
+    accs: Dict[LazyOp, List[UOp]] = {}


can you factor out this accs dict refactor to its own PR?

0xtimmy · 2024-05-21T00:36:50Z

wanna delete the skips too?

but this would mean that the assert len(reduceops) <= 1 has to go as well, or at least get moved

Qazalin · 2024-05-21T00:38:11Z

yeah you can remove it, the diff supports multiple reduces.

0xtimmy · 2024-05-21T17:58:23Z

the failing tests seem to pass on the tinybox when run with GPU=1 and AMD=1 but I will look into it more

Qazalin · 2024-05-21T18:12:41Z

you can skip AMD CI for now - this is how we do it: https://github.com/tinygrad/tinygrad/blob/master/test/test_linearizer.py#L433
I'd look at GPU=1 though

0xtimmy · 2024-05-21T22:43:21Z

I messed around with the GPU=1 a bit and I can't seem to get the issue to reproduce on the tinybox or my machine but I do know how to fix them re:4c96836, which gives each reduceop it's own local buffer

the downside of this is that it might take up more scratch memory, but the upside is that we'll probably have to do it anyways for parallel reduction

Qazalin · 2024-05-23T15:53:48Z

test/test_linearizer.py

  def test_upcast_multireduce_nested_local_upcast(self):
-    x, y, z, w = [Tensor.rand(1,128).realize() for _ in range(4)]
+    x, y, z, w = [Tensor.rand(1,128,128).realize() for _ in range(4)]


oh I think this should've been:

x, y = Tensor.rand(1,128), Tensor.rand(128, 128) z, w = Tensor.rand(1,128), Tensor.rand(128, 128)

yeah, this is what worked for me: #4682 (comment)

Qazalin · 2024-05-23T15:54:18Z

tinygrad/codegen/linearizer.py

@@ -100,10 +100,11 @@ def global_load(self, i:int, idxs:List[Node], acc:Optional[LazyOp]=None, barrier
    for idx, valid, rep_idx in zip(e_idxs, e_valids, iter_idxs(expand_vars)):
      this_const, idx, valid = (invalid_value, NumNode(0), NumNode(1)) if valid.max == 0 else (const, idx, valid)
      # todo: when multiple reduceops are supported, clearly disambiguate and test acc load keys are unique for each reduceop


can remove this todo

Qazalin · 2024-05-23T19:30:49Z

tinygrad/codegen/linearizer.py

    fake_reduce_idxs = [x*0 for x in reduce_idxs]

+    # add a local buffer for multistage reduce. # TODO: use local alias


I need to take more time with this part - Can you explain why this was moved from https://github.com/tinygrad/tinygrad/pull/4259/files#diff-63d57daaca59e5495bbb108da1c13e782f4a408db8503e015e8094ac07c87ea6L354

I assume it's related to https://github.com/tinygrad/tinygrad/pull/4259/files#diff-9ccda767dac5472fe89ca0b898bf591a473186f8fba88d470c19f6d4c219da5eR71?

yeah so for sequential reduceops you should be able to just reuse the same buffer which is why for a while this block of code was in the linearize function

however, it seems to miss behave on GPU: ex the CI for this commit versus the next (which actually makes each local buffer unique)

I'm honestly not sure why, because I can't get the bug to reproduce on my machine or the tinybox but I could trace the issue to the local buffers and giving each reduceop it's own buffer fixed it

I don't think it has anything to do with the earlybufs

OK I see the temp name change - but why is it moved from https://github.com/tinygrad/tinygrad/pull/4259/files#diff-63d57daaca59e5495bbb108da1c13e782f4a408db8503e015e8094ac07c87ea6L354

so that it gets run for every reduceop

we could keep it in 'linearize' and give it it's own for loop but it seemed more natural to put it in 'render_reduceop' so we can keep using '-1' to reference the local_buffer in stuff like 'self.sts' or 'global_store'

Qazalin · 2024-05-23T19:31:55Z

tinygrad/codegen/linearizer.py

@@ -426,10 +436,9 @@ def ast_parse(self, x:LazyOp, accs: Dict[LazyOp, List[UOp]], offs:Optional[List[
    if x.op in BufferOps: return loaded_buffers[x.arg]
    if x.op in [UnaryOps.CAST, UnaryOps.BITCAST]:
      return [self.uops.add(UOps.BITCAST if x.op is UnaryOps.BITCAST else UOps.CAST,
-                            self.get_base_dtype(x.arg), (u,)) for u in self.ast_parse(x.src[0], accs, offs, loaded_buffers)]
+        self.get_base_dtype(x.arg), (u,)) for u in self.ast_parse(x.src[0], accs, offs, loaded_buffers)]


is this diff extra?

I think so, I can remove it

github-actions · 2024-05-23T23:03:17Z

Changes

Name                              Lines    Diff    Tokens/Line    Diff
------------------------------  -------  ------  -------------  ------
tinygrad/codegen/kernel.py          444      -1           17.3    +0.0
tinygrad/codegen/linearizer.py      354     +10           19.1    -0.2
tinygrad/codegen/uops.py            334      +3           17.3    +0.2


total lines changes: +12

0xtimmy added 7 commits April 22, 2024 20:56

basic tests

0f22ccd

cleanup

a55de5a

Merge remote-tracking branch 'upstream/master' into linearizer

d9bf7d6

pylint

273330d

ruff

8b4745f

use define acc as a proxy for rendered reductions

fbec2ed

use define acc as a proxy for rendered reductions

86ae644

Qazalin reviewed Apr 23, 2024

View reviewed changes

0xtimmy added 5 commits April 24, 2024 16:34

recursive reduceop rendering via ast_parse

db1952e

linters + cleanup

c0124ef

Merge remote-tracking branch 'upstream/master' into linearizer

7758b56

fixing late buf loading

7b3c2fe

plus linters

dbda6c3

0xtimmy marked this pull request as ready for review April 25, 2024 04:43

0xtimmy added 3 commits April 25, 2024 12:59

removing extra line

d5b2b08

linters

daa49c9

Merge remote-tracking branch 'upstream/master' into linearizer

0ee55cb

Qazalin mentioned this pull request Apr 27, 2024

Improve reduceop elementwise fusion #4323

Open

3 tasks

chaosagent reviewed May 4, 2024

View reviewed changes

0xtimmy added 7 commits May 5, 2024 19:36

does this break ci?

cc6de56

added tests and if add end change

ac3ab29

typo in add_ends

9d91ee7

linters

b3bd7ac

Merge remote-tracking branch 'upstream/master' into uops-loop-scope

fb3d3c1

removing comments

bd0bba5

allow endifs to be inserted before the end of the graph

962b44c

0xtimmy added 2 commits May 20, 2024 11:44

diff cleanup

52e461d

Merge remote-tracking branch 'upstream/master' into linearizer

da48e35

0xtimmy added 2 commits May 20, 2024 16:06

Merge remote-tracking branch 'upstream/master' into linearizer

9833530

merge w/ master

8b02e06

Qazalin reviewed May 21, 2024

View reviewed changes

adding tests in

7c87fa6

0xtimmy added 2 commits May 21, 2024 11:23

give each reduce it's own local buffer

c10986b

gpu=1 changes

4c96836

0xtimmy added 6 commits May 22, 2024 09:13

Merge remote-tracking branch 'upstream/master' into linearizer

6b55e4f

Merge remote-tracking branch 'upstream/master' into linearizer

e199a8d

store and load locals with upcasting

064e201

Merge remote-tracking branch 'upstream/master' into linearizer

4681cea

modifying test?

c2dd16f

make multireduce_netsted_local_upcast test match single reduce shapes

cadeaa4

Qazalin reviewed May 23, 2024

View reviewed changes

0xtimmy added 2 commits May 23, 2024 09:06

removing todo

08a875b

merging with accs pr

21c6552

Qazalin reviewed May 23, 2024

View reviewed changes

0xtimmy added 2 commits May 23, 2024 12:48

cleaning up the diff

05427a0

Merge remote-tracking branch 'upstream/master' into linearizer

10f6208

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multireduce-Kernels: Linearizer Changes and Tests #4259

Multireduce-Kernels: Linearizer Changes and Tests #4259

0xtimmy commented Apr 23, 2024

Qazalin Apr 23, 2024 •

edited

0xtimmy Apr 24, 2024 •

edited

Qazalin commented Apr 26, 2024

Qazalin commented May 1, 2024 •

edited

0xtimmy commented May 1, 2024

0xtimmy commented May 3, 2024

chaosagent left a comment

chaosagent May 4, 2024

chaosagent May 4, 2024

chaosagent May 4, 2024

0xtimmy commented May 20, 2024

Qazalin commented May 21, 2024

Qazalin May 21, 2024

0xtimmy May 21, 2024

0xtimmy commented May 21, 2024

Qazalin commented May 21, 2024

0xtimmy commented May 21, 2024

Qazalin commented May 21, 2024 •

edited

0xtimmy commented May 21, 2024

Qazalin May 23, 2024

0xtimmy May 23, 2024

Qazalin May 23, 2024

Qazalin May 23, 2024 •

edited

0xtimmy May 23, 2024

Qazalin May 23, 2024

0xtimmy May 23, 2024

Qazalin May 23, 2024

0xtimmy May 23, 2024

github-actions bot commented May 23, 2024

		fake_reduce_idxs = [x*0 for x in reduce_idxs]

		# add a local buffer for multistage reduce. # TODO: use local alias

Multireduce-Kernels: Linearizer Changes and Tests #4259

Are you sure you want to change the base?

Multireduce-Kernels: Linearizer Changes and Tests #4259

Conversation

0xtimmy commented Apr 23, 2024

Qazalin Apr 23, 2024 • edited

Choose a reason for hiding this comment

0xtimmy Apr 24, 2024 • edited

Choose a reason for hiding this comment

Qazalin commented Apr 26, 2024

Qazalin commented May 1, 2024 • edited

0xtimmy commented May 1, 2024

0xtimmy commented May 3, 2024

chaosagent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xtimmy commented May 20, 2024

Qazalin commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xtimmy commented May 21, 2024

Qazalin commented May 21, 2024

0xtimmy commented May 21, 2024

Qazalin commented May 21, 2024 • edited

0xtimmy commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qazalin May 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 23, 2024

Changes

Qazalin Apr 23, 2024 •

edited

0xtimmy Apr 24, 2024 •

edited

Qazalin commented May 1, 2024 •

edited

Qazalin commented May 21, 2024 •

edited

Qazalin May 23, 2024 •

edited