Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#6041: CCL refactoring and sharded allgather optimization #8303

Merged
merged 2 commits into from May 13, 2024

Conversation

SeanNijjar
Copy link
Contributor

@SeanNijjar SeanNijjar commented May 9, 2024

Now that FD2 has dropped, and with it a lot of stability improvements, I'm going through my backlog of CCL changes and starting to merge them into main.

Unfortunately, due to the backlog, some of my changes ended up being intertwined and weren't easily separable. For that reason, reason this PR needed to include some substantial refactor work in preparation for the reduce scatter op. Those refactors are included alongside some sharded allgather optimizations and bug fixes.

At this point, the sharded allgather supports single-tile-high shards only. Multi-tile-high sharded support are coming in a future change.

@tt-rkim - FYI adding myself as codeowner for new CCL directory
@tt-aho - just FYI and to get another set of eyes

@SeanNijjar SeanNijjar requested a review from tt-rkim as a code owner May 9, 2024 14:56
@SeanNijjar SeanNijjar requested a review from tt-aho May 9, 2024 14:56
CODEOWNERS Outdated Show resolved Hide resolved
tests/tt_eager/module.mk Show resolved Hide resolved
@SeanNijjar
Copy link
Contributor Author

SeanNijjar commented May 9, 2024

@SeanNijjar
Copy link
Contributor Author

I'm missing the changes needed to get build working with cmake. Adding.

@SeanNijjar
Copy link
Contributor Author

I had to add the fixes for #6388 before I could safely merge this because there was a race condition that caused instability for sharded allgather as well. The fix for #6388 is included in this PR as well.

Closes #6388

@tt-rkim
Copy link
Collaborator

tt-rkim commented May 13, 2024

Now that I'm re-looking at this, should we take this as an opportunity to start a new gtest suite for eager, or just leave it as is for now, where the run_tt_eager.py script just runs whatever, whether or not the underlying test is gtest or not?

@SeanNijjar
Copy link
Contributor Author

SeanNijjar commented May 13, 2024

Now that I'm re-looking at this, should we take this as an opportunity to start a new gtest suite for eager, or just leave it as is for now, where the run_tt_eager.py script just runs whatever, whether or not the underlying test is gtest or not?

I'm likely going to add more gtests like these over time so from an organizational perspective, I think it would make sense to split it off. It's really just testing helpers. If it's alright with you, how about we figure out how to structure it in my next set of changes? I've got some new op coming down the pipe (reduce_scatter) - I can make the switch then?

@tt-rkim
Copy link
Collaborator

tt-rkim commented May 13, 2024

Sounds good to me, approved

@tt-rkim
Copy link
Collaborator

tt-rkim commented May 13, 2024

I also checked the CPP test run time diff between your branch and main for N300 FD - doesn't seem to be notable

Enable single-tile-high shard width allgather (non-block-based formats) support
…i-chip ops

This hang was caused by a race caused by reuse of a 16B memory region
for two levels of ack sent from receiver to sender side of EDM. The
issue arises because of the following sequence:

EDM Receiver Channel:
1) Receives Payload signal (fine)
2) Sends first level ack to sender
   - Updates `erisc_info->channels[i].receiver_ack` (this becomes a problem)
3) Issues eth send (`src_addr=&erisc_info->channels[i]`)
4) Waits for workers to complet
5) Sends second level ack to sender
   - Updates `erisc_info->channels[i].bytes_sent` to 0 (becomes a problem)
...
Ethernet Subsystem
1) Erisc updates L1 (step 2 above)
2) Gets and buffers first eth send command
3) Erisc updates L1 (step 5 above)
4) Gets and buffers second eth send command
5) Issues first eth send command
   - L1 is already updated to second level ack at this point
   - Sender gets full ack
6) Issues second eth send command
   - Sender gets second (duplicate) full ack
     - It associates that message with the future/next message

To resolve this issue, we force the first level ack to be performed
using a source address for eth_channel_sync_t that doesn't alias
`erisc_info->channels[i]`

Also re-enable all-gather test configs
@SeanNijjar SeanNijjar merged commit 90955b9 into main May 13, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants