Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multireduce-Kernels: Linearizer Changes and Tests #4259

Open
wants to merge 79 commits into
base: master
Choose a base branch
from

Conversation

0xtimmy
Copy link
Contributor

@0xtimmy 0xtimmy commented Apr 23, 2024

This is the linearizer stage of the multireduce kernels PR

It contains changes to linearizer.py, kernel.py, and uops.py to allows them to produce kernels with multiple reductions
Test cases have been added to test_linearizer.py that check a few things:

  • simple fusion of reduceops
  • fusion with intermediate calculations (both at the output shape and full shape)
  • fusion with multiple outputs (outputs must be the same shape)
  • re: @Qazalin , I didn't add your tests in directly because this PR doesn't change the scheduler, but I did put in one case that uses the same ast, there were a few issues 1) was the ordering of reduceops and 2) that intermediate values would sometimes get rendered within a for loop. I fixed both

The main changes to linearizer.py are touching up spots that depend on there being only one reduceop, for example storing the result of a local reduce in a temp buffer so it can be loaded back into every thread
I also changed kernel.py so the "deeper" reduceops are rendered first
And I had to change uops.py so that if statements can be closed before the end of the kernel

reduceops = [x for x in self.lazyops if x.op in ReduceOps]
assert len(dedup(reduceops)) <= 1, "max one reduce op in an ast"
self.reduceop = reduceops[0] if reduceops else None
def get_rchildren(x:LazyOp): return ([x] if x.op in ReduceOps else []) + [r for s in x.src for r in get_rchildren(s)]
Copy link
Collaborator

@Qazalin Qazalin Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import time
from tinygrad.codegen.linearizer import Linearizer
from tinygrad.engine.schedule import create_schedule
from tinygrad.tensor import Tensor

st = time.perf_counter()
a = Tensor([1,2,3,4])
for _ in range(24): a = a + a
r = a.sum()
ast = create_schedule([r.lazydata])[-1].ast
lin = Linearizer(*ast)
assert time.perf_counter()-st < 1.0

you should keep track of visited nodes.

Copy link
Contributor Author

@0xtimmy 0xtimmy Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I futzed around with it and the traversal seems to just be too slow in general so instead of reordering them when the kernel get initialized, I render them in ast_parse because ast_parse is recursing through the ast anyways.

I did have to make some changes to uops.py so I'm making some more tests to make sure things still linearize properly

@0xtimmy 0xtimmy marked this pull request as ready for review April 25, 2024 04:43
@Qazalin
Copy link
Collaborator

Qazalin commented Apr 26, 2024

This diff is very hard to review, can you break it down to isolated PRs?

@Qazalin
Copy link
Collaborator

Qazalin commented May 1, 2024

@0xtimmy can you provide a status update?
What are the blockers to getting the linearizer changes merged?

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 1, 2024

finals week, I can have a PR up probably tomorrow

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 3, 2024

in regards to @chenyuxyz 's comment abt using insert_before I came up with two additional solutions:

  1. we can "chunk" each reduceop: so keep rendering them in ast parse and just place the chunk of new uops before any existing loop context
  2. we can render all the recudeops out of order with some placeholder for each accumulator, then move them into the proper order and switch out the placeholder during the main ast_parse

just some ideas I had, I can put up PRs once they're tested

Copy link
Contributor

@chaosagent chaosagent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small things

reduceops = [x for x in self.lazyops if x.op in ReduceOps]
assert len(dedup(reduceops)) <= 1, "max one reduce op in an ast"
self.reduceop = reduceops[0] if reduceops else None
self.reduceops = list(set([x for x in self.lazyops if x.op in ReduceOps]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have tinygrad.helpers.dedup

ret.reduceop, ret.outbufs, ret.vars, ret.bufs, ret.earlybufs, ret.full_buf_index = \
self.reduceop, self.outbufs, self.vars, [x for x in self.bufs if not isinstance(x, LocalBuffer)], self.earlybufs, self.full_buf_index
ret.reduceops, ret.outbufs, ret.vars, ret.bufs, ret.earlybufs, ret.full_buf_index = \
self.reduceops, self.outbufs, self.vars, [x for x in self.bufs if not isinstance(x, LocalBuffer)], self.earlybufs, self.full_buf_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ret.reduceops = self.reduceops[:]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, reduceops is not mutated

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 20, 2024

uops.py got a rewrite fyi - I think this diff requires some changes. The tests are almost there, can you create separate PR for getting the tests upstreamed? I'd add tests for 2+ back to back reduces as well. We need a robust way for testing loop scopes.

I made a separate pr for the tests here: #4665

@Qazalin
Copy link
Collaborator

Qazalin commented May 21, 2024

wanna delete the skips too?

@@ -390,21 +401,20 @@ def linearize(self):

# parse AST
loaded_buffers:Dict[Union[MemBuffer, ConstBuffer, LocalBuffer], List[UOp]] = {}
acc: List[UOp] = []
accs: Dict[LazyOp, List[UOp]] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you factor out this accs dict refactor to its own PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sir

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 21, 2024

wanna delete the skips too?

but this would mean that the assert len(reduceops) <= 1 has to go as well, or at least get moved

@Qazalin
Copy link
Collaborator

Qazalin commented May 21, 2024

yeah you can remove it, the diff supports multiple reduces.

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 21, 2024

the failing tests seem to pass on the tinybox when run with GPU=1 and AMD=1 but I will look into it more

@Qazalin
Copy link
Collaborator

Qazalin commented May 21, 2024

you can skip AMD CI for now - this is how we do it: https://github.com/tinygrad/tinygrad/blob/master/test/test_linearizer.py#L433
I'd look at GPU=1 though

@0xtimmy
Copy link
Contributor Author

0xtimmy commented May 21, 2024

I messed around with the GPU=1 a bit and I can't seem to get the issue to reproduce on the tinybox or my machine but I do know how to fix them re:4c96836, which gives each reduceop it's own local buffer

the downside of this is that it might take up more scratch memory, but the upside is that we'll probably have to do it anyways for parallel reduction

def test_upcast_multireduce_nested_local_upcast(self):
x, y, z, w = [Tensor.rand(1,128).realize() for _ in range(4)]
x, y, z, w = [Tensor.rand(1,128,128).realize() for _ in range(4)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I think this should've been:

x, y = Tensor.rand(1,128), Tensor.rand(128, 128)
z, w = Tensor.rand(1,128), Tensor.rand(128, 128)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is what worked for me: #4682 (comment)

@@ -100,10 +100,11 @@ def global_load(self, i:int, idxs:List[Node], acc:Optional[LazyOp]=None, barrier
for idx, valid, rep_idx in zip(e_idxs, e_valids, iter_idxs(expand_vars)):
this_const, idx, valid = (invalid_value, NumNode(0), NumNode(1)) if valid.max == 0 else (const, idx, valid)
# todo: when multiple reduceops are supported, clearly disambiguate and test acc load keys are unique for each reduceop
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this todo

fake_reduce_idxs = [x*0 for x in reduce_idxs]

# add a local buffer for multistage reduce. # TODO: use local alias
Copy link
Collaborator

@Qazalin Qazalin May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah so for sequential reduceops you should be able to just reuse the same buffer which is why for a while this block of code was in the linearize function

however, it seems to miss behave on GPU: ex the CI for this commit versus the next (which actually makes each local buffer unique)

I'm honestly not sure why, because I can't get the bug to reproduce on my machine or the tinybox but I could trace the issue to the local buffers and giving each reduceop it's own buffer fixed it

I don't think it has anything to do with the earlybufs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that it gets run for every reduceop

we could keep it in 'linearize' and give it it's own for loop but it seemed more natural to put it in 'render_reduceop' so we can keep using '-1' to reference the local_buffer in stuff like 'self.sts' or 'global_store'

@@ -426,10 +436,9 @@ def ast_parse(self, x:LazyOp, accs: Dict[LazyOp, List[UOp]], offs:Optional[List[
if x.op in BufferOps: return loaded_buffers[x.arg]
if x.op in [UnaryOps.CAST, UnaryOps.BITCAST]:
return [self.uops.add(UOps.BITCAST if x.op is UnaryOps.BITCAST else UOps.CAST,
self.get_base_dtype(x.arg), (u,)) for u in self.ast_parse(x.src[0], accs, offs, loaded_buffers)]
self.get_base_dtype(x.arg), (u,)) for u in self.ast_parse(x.src[0], accs, offs, loaded_buffers)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this diff extra?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, I can remove it

Copy link
Contributor

Changes

Name                              Lines    Diff    Tokens/Line    Diff
------------------------------  -------  ------  -------------  ------
tinygrad/codegen/kernel.py          444      -1           17.3    +0.0
tinygrad/codegen/linearizer.py      354     +10           19.1    -0.2
tinygrad/codegen/uops.py            334      +3           17.3    +0.2


total lines changes: +12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants