#6041: CCL refactoring and sharded allgather optimization #8303

SeanNijjar · 2024-05-09T14:56:44Z

Now that FD2 has dropped, and with it a lot of stability improvements, I'm going through my backlog of CCL changes and starting to merge them into main.

Unfortunately, due to the backlog, some of my changes ended up being intertwined and weren't easily separable. For that reason, reason this PR needed to include some substantial refactor work in preparation for the reduce scatter op. Those refactors are included alongside some sharded allgather optimizations and bug fixes.

At this point, the sharded allgather supports single-tile-high shards only. Multi-tile-high sharded support are coming in a future change.

@tt-rkim - FYI adding myself as codeowner for new CCL directory
@tt-aho - just FYI and to get another set of eyes

CODEOWNERS

tests/tt_eager/module.mk

SeanNijjar · 2024-05-09T15:06:10Z

Workflows:
post-commit: https://github.com/tenstorrent/tt-metal/actions/runs/9064736452
multichip unit tests: https://github.com/tenstorrent/tt-metal/actions/runs/9064747940

SeanNijjar · 2024-05-09T15:29:25Z

I'm missing the changes needed to get build working with cmake. Adding.

SeanNijjar · 2024-05-13T14:32:11Z

I had to add the fixes for #6388 before I could safely merge this because there was a race condition that caused instability for sharded allgather as well. The fix for #6388 is included in this PR as well.

Closes #6388

tt-rkim · 2024-05-13T14:39:20Z

Now that I'm re-looking at this, should we take this as an opportunity to start a new gtest suite for eager, or just leave it as is for now, where the run_tt_eager.py script just runs whatever, whether or not the underlying test is gtest or not?

SeanNijjar · 2024-05-13T14:42:52Z

Now that I'm re-looking at this, should we take this as an opportunity to start a new gtest suite for eager, or just leave it as is for now, where the run_tt_eager.py script just runs whatever, whether or not the underlying test is gtest or not?

I'm likely going to add more gtests like these over time so from an organizational perspective, I think it would make sense to split it off. It's really just testing helpers. If it's alright with you, how about we figure out how to structure it in my next set of changes? I've got some new op coming down the pipe (reduce_scatter) - I can make the switch then?

tt-rkim · 2024-05-13T14:59:40Z

Sounds good to me, approved

tt-rkim · 2024-05-13T15:15:33Z

I also checked the CPP test run time diff between your branch and main for N300 FD - doesn't seem to be notable

Enable single-tile-high shard width allgather (non-block-based formats) support

…i-chip ops This hang was caused by a race caused by reuse of a 16B memory region for two levels of ack sent from receiver to sender side of EDM. The issue arises because of the following sequence: EDM Receiver Channel: 1) Receives Payload signal (fine) 2) Sends first level ack to sender - Updates `erisc_info->channels[i].receiver_ack` (this becomes a problem) 3) Issues eth send (`src_addr=&erisc_info->channels[i]`) 4) Waits for workers to complet 5) Sends second level ack to sender - Updates `erisc_info->channels[i].bytes_sent` to 0 (becomes a problem) ... Ethernet Subsystem 1) Erisc updates L1 (step 2 above) 2) Gets and buffers first eth send command 3) Erisc updates L1 (step 5 above) 4) Gets and buffers second eth send command 5) Issues first eth send command - L1 is already updated to second level ack at this point - Sender gets full ack 6) Issues second eth send command - Sender gets second (duplicate) full ack - It associates that message with the future/next message To resolve this issue, we force the first level ack to be performed using a source address for eth_channel_sync_t that doesn't alias `erisc_info->channels[i]` Also re-enable all-gather test configs

SeanNijjar requested a review from tt-rkim as a code owner May 9, 2024 14:56

SeanNijjar requested a review from tt-aho May 9, 2024 14:56

SeanNijjar force-pushed the snijjar/issue-6041 branch from 5820cc1 to b4cf99a Compare May 9, 2024 15:00

SeanNijjar had a problem deploying to dev May 9, 2024 15:04 — with GitHub Actions Failure

SeanNijjar temporarily deployed to dev May 9, 2024 15:04 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to dev May 9, 2024 15:04 — with GitHub Actions Failure

SeanNijjar temporarily deployed to dev May 9, 2024 15:04 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to dev May 9, 2024 15:04 — with GitHub Actions Failure

SeanNijjar temporarily deployed to dev May 9, 2024 15:04 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to dev May 9, 2024 15:05 — with GitHub Actions Failure

tt-rkim reviewed May 9, 2024

View reviewed changes

CODEOWNERS Outdated Show resolved Hide resolved

tests/tt_eager/module.mk Show resolved Hide resolved

SeanNijjar had a problem deploying to production May 9, 2024 15:31 — with GitHub Actions Error

SeanNijjar had a problem deploying to production May 9, 2024 15:31 — with GitHub Actions Failure

SeanNijjar temporarily deployed to dev May 13, 2024 14:30 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to dev May 13, 2024 14:30 — with GitHub Actions Failure

SeanNijjar temporarily deployed to dev May 13, 2024 14:34 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to production May 13, 2024 14:58 — with GitHub Actions Inactive

tt-rkim approved these changes May 13, 2024

View reviewed changes

aliuTT approved these changes May 13, 2024

View reviewed changes

SeanNijjar force-pushed the snijjar/issue-6041 branch from ce258e3 to e5835ec Compare May 13, 2024 19:18

SeanNijjar added 2 commits May 13, 2024 19:19

#6041: CCL refactoring and sharded allgather optimization

9ec4876

Enable single-tile-high shard width allgather (non-block-based formats) support

SeanNijjar force-pushed the snijjar/issue-6041 branch from e5835ec to ebf5230 Compare May 13, 2024 19:21

SeanNijjar merged commit 90955b9 into main May 13, 2024
5 checks passed

SeanNijjar mentioned this pull request May 13, 2024

All Gather test hangs after many runs in loop #6388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#6041: CCL refactoring and sharded allgather optimization #8303

#6041: CCL refactoring and sharded allgather optimization #8303

SeanNijjar commented May 9, 2024 •

edited

SeanNijjar commented May 9, 2024 •

edited

SeanNijjar commented May 9, 2024

SeanNijjar commented May 13, 2024

tt-rkim commented May 13, 2024

SeanNijjar commented May 13, 2024 •

edited

tt-rkim commented May 13, 2024

tt-rkim commented May 13, 2024

#6041: CCL refactoring and sharded allgather optimization #8303

#6041: CCL refactoring and sharded allgather optimization #8303

Conversation

SeanNijjar commented May 9, 2024 • edited

SeanNijjar commented May 9, 2024 • edited

SeanNijjar commented May 9, 2024

SeanNijjar commented May 13, 2024

tt-rkim commented May 13, 2024

SeanNijjar commented May 13, 2024 • edited

tt-rkim commented May 13, 2024

tt-rkim commented May 13, 2024

SeanNijjar commented May 9, 2024 •

edited

SeanNijjar commented May 9, 2024 •

edited

SeanNijjar commented May 13, 2024 •

edited