Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to parallel FFD bin packing algorithm. #1619

Merged
merged 7 commits into from
May 23, 2024
Merged

Switch to parallel FFD bin packing algorithm. #1619

merged 7 commits into from
May 23, 2024

Conversation

winglian
Copy link
Collaborator

@winglian winglian commented May 15, 2024

Add support for packing in a distributed context.
Add packing efficiency estimate back.

See #1516 by @dsesclei. Attempting to rebase the original PR onto the latest main wasn't terribly clean. I also reverted the change to the distributed code.

@winglian
Copy link
Collaborator Author

@dsesclei Something doesn't seem quite right.

here's the original estimate for the openhermes dataset

[2024-05-23 19:14:18,019] [DEBUG] [axolotl.log:61] [PID:261367] [RANK:0] total_num_tokens: 370_825_938
[2024-05-23 19:14:22,669] [DEBUG] [axolotl.log:61] [PID:261367] [RANK:0] `total_supervised_tokens: 198_133_103`
[2024-05-23 19:14:23,257] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:261367] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 370825938
[2024-05-23 19:14:23,258] [DEBUG] [axolotl.log:61] [PID:261367] [RANK:0] data_loader_len: 11203
[2024-05-23 19:14:23,258] [INFO] [axolotl.log:61] [PID:261367] [RANK:0] sample_packing_eff_est across ranks: [0.9637393684216654]
[2024-05-23 19:14:23,258] [DEBUG] [axolotl.log:61] [PID:261367] [RANK:0] sample_packing_eff_est: 0.97

and with the new algorithm it's about half the number of steps although it's going from a packing efficiency of 0.97 to 0.999

[2024-05-23 19:04:34,135] [DEBUG] [axolotl.log:61] [PID:261098] [RANK:0] total_num_tokens: 370_825_938
[2024-05-23 19:04:38,736] [DEBUG] [axolotl.log:61] [PID:261098] [RANK:0] `total_supervised_tokens: 198_133_103`
[2024-05-23 19:04:44,609] [DEBUG] [axolotl.log:61] [PID:261098] [RANK:0] data_loader_len: 5663
[2024-05-23 19:04:44,609] [INFO] [axolotl.log:61] [PID:261098] [RANK:0] sample_packing_eff_est across ranks: [0.9990915099930614]
[2024-05-23 19:04:44,609] [DEBUG] [axolotl.log:61] [PID:261098] [RANK:0] sample_packing_eff_est: 1.0

with context of 4k, micro batch size of 2 and gradient accumulation steps of 4, I'd expect the following lengths, but the new is off by exactly a factor of 1/2x
old: 370825938/4096 ctx/0.963 eff/4 gas/2 mbsz= 11751
new: 370825938/4096 ctx/0.999 eff/4 gas/2 mbsz= 11328

@winglian
Copy link
Collaborator Author

alright, corrected the dataset length calculation and it all seems sane now.

@winglian winglian merged commit 367b2e8 into main May 23, 2024
7 checks passed
@winglian winglian deleted the ds-packing-v2 branch May 23, 2024 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants