New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multireduce-Kernels: Linearizer Changes and Tests #4259
base: master
Are you sure you want to change the base?
Conversation
tinygrad/codegen/kernel.py
Outdated
reduceops = [x for x in self.lazyops if x.op in ReduceOps] | ||
assert len(dedup(reduceops)) <= 1, "max one reduce op in an ast" | ||
self.reduceop = reduceops[0] if reduceops else None | ||
def get_rchildren(x:LazyOp): return ([x] if x.op in ReduceOps else []) + [r for s in x.src for r in get_rchildren(s)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import time
from tinygrad.codegen.linearizer import Linearizer
from tinygrad.engine.schedule import create_schedule
from tinygrad.tensor import Tensor
st = time.perf_counter()
a = Tensor([1,2,3,4])
for _ in range(24): a = a + a
r = a.sum()
ast = create_schedule([r.lazydata])[-1].ast
lin = Linearizer(*ast)
assert time.perf_counter()-st < 1.0
you should keep track of visited nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I futzed around with it and the traversal seems to just be too slow in general so instead of reordering them when the kernel get initialized, I render them in ast_parse
because ast_parse
is recursing through the ast anyways.
I did have to make some changes to uops.py
so I'm making some more tests to make sure things still linearize properly
This diff is very hard to review, can you break it down to isolated PRs? |
@0xtimmy can you provide a status update? |
finals week, I can have a PR up probably tomorrow |
in regards to @chenyuxyz 's comment abt using
just some ideas I had, I can put up PRs once they're tested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small things
tinygrad/codegen/kernel.py
Outdated
reduceops = [x for x in self.lazyops if x.op in ReduceOps] | ||
assert len(dedup(reduceops)) <= 1, "max one reduce op in an ast" | ||
self.reduceop = reduceops[0] if reduceops else None | ||
self.reduceops = list(set([x for x in self.lazyops if x.op in ReduceOps])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have tinygrad.helpers.dedup
tinygrad/codegen/kernel.py
Outdated
ret.reduceop, ret.outbufs, ret.vars, ret.bufs, ret.earlybufs, ret.full_buf_index = \ | ||
self.reduceop, self.outbufs, self.vars, [x for x in self.bufs if not isinstance(x, LocalBuffer)], self.earlybufs, self.full_buf_index | ||
ret.reduceops, ret.outbufs, ret.vars, ret.bufs, ret.earlybufs, ret.full_buf_index = \ | ||
self.reduceops, self.outbufs, self.vars, [x for x in self.bufs if not isinstance(x, LocalBuffer)], self.earlybufs, self.full_buf_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ret.reduceops = self.reduceops[:]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, reduceops is not mutated
I made a separate pr for the tests here: #4665 |
wanna delete the skips too? |
@@ -390,21 +401,20 @@ def linearize(self): | |||
|
|||
# parse AST | |||
loaded_buffers:Dict[Union[MemBuffer, ConstBuffer, LocalBuffer], List[UOp]] = {} | |||
acc: List[UOp] = [] | |||
accs: Dict[LazyOp, List[UOp]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you factor out this accs dict refactor to its own PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes sir
but this would mean that the |
yeah you can remove it, the diff supports multiple reduces. |
the failing tests seem to pass on the tinybox when run with |
you can skip AMD CI for now - this is how we do it: https://github.com/tinygrad/tinygrad/blob/master/test/test_linearizer.py#L433 |
I messed around with the the downside of this is that it might take up more scratch memory, but the upside is that we'll probably have to do it anyways for parallel reduction |
test/test_linearizer.py
Outdated
def test_upcast_multireduce_nested_local_upcast(self): | ||
x, y, z, w = [Tensor.rand(1,128).realize() for _ in range(4)] | ||
x, y, z, w = [Tensor.rand(1,128,128).realize() for _ in range(4)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I think this should've been:
x, y = Tensor.rand(1,128), Tensor.rand(128, 128)
z, w = Tensor.rand(1,128), Tensor.rand(128, 128)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is what worked for me: #4682 (comment)
tinygrad/codegen/linearizer.py
Outdated
@@ -100,10 +100,11 @@ def global_load(self, i:int, idxs:List[Node], acc:Optional[LazyOp]=None, barrier | |||
for idx, valid, rep_idx in zip(e_idxs, e_valids, iter_idxs(expand_vars)): | |||
this_const, idx, valid = (invalid_value, NumNode(0), NumNode(1)) if valid.max == 0 else (const, idx, valid) | |||
# todo: when multiple reduceops are supported, clearly disambiguate and test acc load keys are unique for each reduceop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove this todo
fake_reduce_idxs = [x*0 for x in reduce_idxs] | ||
|
||
# add a local buffer for multistage reduce. # TODO: use local alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to take more time with this part - Can you explain why this was moved from https://github.com/tinygrad/tinygrad/pull/4259/files#diff-63d57daaca59e5495bbb108da1c13e782f4a408db8503e015e8094ac07c87ea6L354
I assume it's related to https://github.com/tinygrad/tinygrad/pull/4259/files#diff-9ccda767dac5472fe89ca0b898bf591a473186f8fba88d470c19f6d4c219da5eR71?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah so for sequential reduceops you should be able to just reuse the same buffer which is why for a while this block of code was in the linearize
function
however, it seems to miss behave on GPU: ex the CI for this commit versus the next (which actually makes each local buffer unique)
I'm honestly not sure why, because I can't get the bug to reproduce on my machine or the tinybox but I could trace the issue to the local buffers and giving each reduceop it's own buffer fixed it
I don't think it has anything to do with the earlybufs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see the temp name change - but why is it moved from https://github.com/tinygrad/tinygrad/pull/4259/files#diff-63d57daaca59e5495bbb108da1c13e782f4a408db8503e015e8094ac07c87ea6L354
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that it gets run for every reduceop
we could keep it in 'linearize' and give it it's own for loop but it seemed more natural to put it in 'render_reduceop' so we can keep using '-1' to reference the local_buffer in stuff like 'self.sts' or 'global_store'
tinygrad/codegen/linearizer.py
Outdated
@@ -426,10 +436,9 @@ def ast_parse(self, x:LazyOp, accs: Dict[LazyOp, List[UOp]], offs:Optional[List[ | |||
if x.op in BufferOps: return loaded_buffers[x.arg] | |||
if x.op in [UnaryOps.CAST, UnaryOps.BITCAST]: | |||
return [self.uops.add(UOps.BITCAST if x.op is UnaryOps.BITCAST else UOps.CAST, | |||
self.get_base_dtype(x.arg), (u,)) for u in self.ast_parse(x.src[0], accs, offs, loaded_buffers)] | |||
self.get_base_dtype(x.arg), (u,)) for u in self.ast_parse(x.src[0], accs, offs, loaded_buffers)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this diff extra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, I can remove it
Changes
|
This is the linearizer stage of the multireduce kernels PR
It contains changes to
linearizer.py
,kernel.py
, anduops.py
to allows them to produce kernels with multiple reductionsTest cases have been added to
test_linearizer.py
that check a few things:The main changes to
linearizer.py
are touching up spots that depend on there being only one reduceop, for example storing the result of a local reduce in a temp buffer so it can be loaded back into every threadI also changed
kernel.py
so the "deeper" reduceops are rendered firstAnd I had to change
uops.py
so that if statements can be closed before the end of the kernel