-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#6041: CCL refactoring and sharded allgather optimization #8303
Conversation
5820cc1
to
b4cf99a
Compare
Workflows: |
I'm missing the changes needed to get build working with cmake. Adding. |
Now that I'm re-looking at this, should we take this as an opportunity to start a new gtest suite for eager, or just leave it as is for now, where the |
I'm likely going to add more gtests like these over time so from an organizational perspective, I think it would make sense to split it off. It's really just testing helpers. If it's alright with you, how about we figure out how to structure it in my next set of changes? I've got some new op coming down the pipe (reduce_scatter) - I can make the switch then? |
Sounds good to me, approved |
I also checked the CPP test run time diff between your branch and main for N300 FD - doesn't seem to be notable |
ce258e3
to
e5835ec
Compare
Enable single-tile-high shard width allgather (non-block-based formats) support
…i-chip ops This hang was caused by a race caused by reuse of a 16B memory region for two levels of ack sent from receiver to sender side of EDM. The issue arises because of the following sequence: EDM Receiver Channel: 1) Receives Payload signal (fine) 2) Sends first level ack to sender - Updates `erisc_info->channels[i].receiver_ack` (this becomes a problem) 3) Issues eth send (`src_addr=&erisc_info->channels[i]`) 4) Waits for workers to complet 5) Sends second level ack to sender - Updates `erisc_info->channels[i].bytes_sent` to 0 (becomes a problem) ... Ethernet Subsystem 1) Erisc updates L1 (step 2 above) 2) Gets and buffers first eth send command 3) Erisc updates L1 (step 5 above) 4) Gets and buffers second eth send command 5) Issues first eth send command - L1 is already updated to second level ack at this point - Sender gets full ack 6) Issues second eth send command - Sender gets second (duplicate) full ack - It associates that message with the future/next message To resolve this issue, we force the first level ack to be performed using a source address for eth_channel_sync_t that doesn't alias `erisc_info->channels[i]` Also re-enable all-gather test configs
e5835ec
to
ebf5230
Compare
Now that FD2 has dropped, and with it a lot of stability improvements, I'm going through my backlog of CCL changes and starting to merge them into main.
Unfortunately, due to the backlog, some of my changes ended up being intertwined and weren't easily separable. For that reason, reason this PR needed to include some substantial refactor work in preparation for the reduce scatter op. Those refactors are included alongside some sharded allgather optimizations and bug fixes.
At this point, the sharded allgather supports single-tile-high shards only. Multi-tile-high sharded support are coming in a future change.
@tt-rkim - FYI adding myself as codeowner for new CCL directory
@tt-aho - just FYI and to get another set of eyes