Preorder reduces #4409

0xtimmy · 2024-05-03T21:27:44Z

@Qazalin turns out preordering them was p easy

drafting this PR because it includes the linearizer changes (so I could test that it works for asts with multiple reduceops). I can break it down into PRs for the ordering code & linearizer changes if that would be easier

github-actions · 2024-05-03T22:39:49Z

Changes

Name                              Lines    Diff    Tokens/Line    Diff
------------------------------  -------  ------  -------------  ------
tinygrad/codegen/kernel.py          456      +1           18.5    +0.1
tinygrad/codegen/linearizer.py      350      +9           19.2    -0.1
tinygrad/codegen/uops.py            324      +1           17.2    +0.1


total lines changes: +11

chaosagent · 2024-05-04T13:58:15Z

oh, i overlooked this PR. this is pretty much what i had in mind, nice!

chaosagent · 2024-05-04T14:07:58Z

tinygrad/codegen/linearizer.py

    if x.op in ReduceOps and not do_reduce:
-      assert offs is None, "not available if we aren't doing reduce"
-      return acc
+      return [reduced_ops[x][i] for i in offs] if offs else reduced_ops[x]


i think offs just removes unrolled shapes from iteration, so just returning reduced_ops[x] here is correct.

maybe I goofed something elsewhere and this just covers it up, but this was so that, if there's an intermediate calculation between two reduceops, it gets brought back to the unrolled shape

ex: an unrolled standard deviation looks like this without of the offs

__kernel void r_4_32_8n4(__global float* data0, const __global float* data1) { int gidx0 = get_group_id(0); /* 4 */ int lidx1 = get_local_id(0); /* 32 */ float acc0 = 0.0f; int alu0 = ((gidx0*256)+(lidx1*8)); float val0 = data1[alu0]; float val1 = data1[alu0+1]; float val2 = data1[alu0+2]; float val3 = data1[alu0+3]; float val4 = data1[alu0+4]; float val5 = data1[alu0+5]; float val6 = data1[alu0+6]; float val7 = data1[alu0+7]; float acc1 = 0.0f; data0[(gidx0*32)+lidx1] = (0.015625f*((val0-((val7+val6+val5+val4+val3+val2+val1+val0+acc0)*0.015625f))+acc1)); }

when what we want is this:

__kernel void r_4_32_8n4(__global float* data0, const __global float* data1) { int gidx0 = get_group_id(0); /* 4 */ int lidx1 = get_local_id(0); /* 32 */ float acc0 = 0.0f; int alu0 = ((gidx0*256)+(lidx1*8)); float val0 = data1[alu0]; float val1 = data1[alu0+1]; float val2 = data1[alu0+2]; float val3 = data1[alu0+3]; float val4 = data1[alu0+4]; float val5 = data1[alu0+5]; float val6 = data1[alu0+6]; float val7 = data1[alu0+7]; float acc1 = 0.0f; float alu1 = ((val7+val6+val5+val4+val3+val2+val1+val0+acc0)*0.015625f); data0[(gidx0*32)+lidx1] = (0.015625f*((val7-alu1)+(val6-alu1)+(val5-alu1)+(val4-alu1)+(val3-alu1)+(val2-alu1)+(val1-alu1)+(val0-alu1)+acc1)); }

oh, i see. makes sense!

0xtimmy · 2024-05-04T15:00:50Z

tinygrad/codegen/linearizer.py

+        stores = self.global_store(-1, [NumNode(0)]*len(self.sts[-1].shape), acc)
+        endif = self.uops.add(UOps.ENDIF, None, (barrier,), cachable=False)
+        barrier = self.uops.add(UOps.BARRIER, None, (endif,), cachable=False)
+        acc = self.global_load(-1, [NumNode(0)]*len(self.sts[-1].shape), barrier=barrier)


load local reduction back into every thread

chaosagent · 2024-05-04T15:03:22Z

tinygrad/codegen/linearizer.py

-    if x.op is UnaryOps.CAST: return [self.uops.add(UOps.BITCAST if x.arg[1] else UOps.CAST, self.get_base_dtype(x.arg[0]), (u,), x.arg[0], \
-                  insert_before=insert_before) for u in self.ast_parse(x.src[0], acc, offs, loaded_buffers, insert_before=insert_before)]
+    if x.op is UnaryOps.CAST: return [self.uops.add(UOps.BITCAST if x.arg[1] else UOps.CAST, self.get_base_dtype(x.arg[0]), (u,), x.arg[0]) \
+                                       for u in self.ast_parse(x.src[0], acc, offs, loaded_buffers, reduced_ops)]
    if x.op in ReduceOps and not do_reduce:


don't we want to do this read also if do_reduce and x is not the reduce we want to do here?

I'm not sure exactly what you mean but it seemed like the only time do_reduce was set to true was when ast_parse was directly fed the reduceop:

linerizer.py:256 self.ast_parse(reduceop, acc, self.acc_offsets(self.full_buf_index), loaded_buffers, do_reduce=True, loop_ctx=loop_ctx)

linearizer.py:298 self.ast_parse(LazyOp(reduceop.op, (LazyOp(BufferOps.LOAD, (), self.bufs[-1]),)), acc, self.acc_offsets(-1), loaded_buffers, do_reduce=True, loop_ctx=loop_ctx) # noqa: E501

I agree, we do want to do that read, but in that case we'd have to pass the reduceop we want to do to ast_parse as well
I suppose that's not that hard, we just add another parameter or we could make do_reduce an Optional[LazyOp] and just pass it there

it seemed like the only time do_reduce was set to true was when ast_parse was directly fed the reduceop

on the second-pass reduce, ast_parse will recurse down to the first-pass reduce's reduceop, right?

yeah but do_reduce will get turned off when it recurses through

values = [self.ast_parse(v, acc, offs, loaded_buffers, reduced_ops, loop_ctx=loop_ctx, cache=cache) for v in x.src]

also re: #4419 : reduced_ops is my way of accomplishing what I think you do with accs, I think replacing accs is prettier esp if it helps with parallel reduceops

0xtimmy · 2024-05-04T15:05:07Z

tinygrad/codegen/uops.py

@@ -349,7 +349,8 @@ def optimize_loops(self):
    # graph helper functions
    @functools.lru_cache(None)
    def get_recursive_parents(x:UOp, with_phi=False) -> Set[UOp]:
-      return set.union(set(x.vin), *[get_recursive_parents(p, with_phi) for p in x.vin], set(acc_scope[x]) if with_phi else set())
+      return set.union(set(x.vin), *[get_recursive_parents(p, with_phi) for p in x.vin if p.uop is not UOps.BARRIER], \
+                       set(acc_scope[x]) if with_phi else set())


don't propagate get_recursive_parents through a barrier;

0xtimmy added 6 commits May 3, 2024 17:10

preorder reductions

81f3e07

linters

8ca9959

Merge remote-tracking branch 'upstream/master' into preorder-reduces

4145d55

remove self.reduce_accs

778781b

linters

af52eca

typo

0be5a34

0xtimmy mentioned this pull request May 4, 2024

Move Reduceops into ast_parse - pop ctx method #4407

Closed

chaosagent reviewed May 4, 2024

View reviewed changes

Qazalin mentioned this pull request May 4, 2024

Improve reduceop elementwise fusion #4323

Open

3 tasks

chaosagent mentioned this pull request May 4, 2024

refactor: linearizer support multiple sets of accumulators #4419

Open

0xtimmy commented May 4, 2024

View reviewed changes

chaosagent reviewed May 4, 2024

View reviewed changes

0xtimmy commented May 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preorder reduces #4409

Preorder reduces #4409

0xtimmy commented May 3, 2024

github-actions bot commented May 3, 2024

chaosagent commented May 4, 2024

chaosagent May 4, 2024

0xtimmy May 4, 2024

chaosagent May 4, 2024

0xtimmy May 4, 2024

chaosagent May 4, 2024

0xtimmy May 4, 2024

chaosagent May 4, 2024

0xtimmy May 4, 2024

0xtimmy May 4, 2024

Preorder reduces #4409

Are you sure you want to change the base?

Preorder reduces #4409

Conversation

0xtimmy commented May 3, 2024

github-actions bot commented May 3, 2024

Changes

chaosagent commented May 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment