Switch to parallel FFD bin packing algorithm (closes #1492) #1516

dsesclei · 2024-04-11T02:36:19Z

Description

Replace the existing sample packing algorithm with a parallel implementation of first-fit-decreasing.

Motivation and Context

I noticed recently that we could get denser sample packing with a different algorithm. Looking into it more, FFD performs just as well and is much faster than the heuristic I had 😅.

We can run FFD in parallel without losing much performance by packing samples in groups rather than all at once. On an i9-14900k, it takes 2.2s to pack 1M samples with 99.7% efficiency (current multipack.py is 91.7% in 0.32s.)

I removed the length estimates around packing in favor of just counting the batches, but let me know if I should add that back in. Two new config options are added: sample_packing_group_size controls the the number of samples packed by each process, and sample_packing_bin_size sets the number of samples that can be placed in one pack (may need to be increased for large context lengths.)

How has this been tested?

Tests have been updated to verify that packing is correct. Training appears to run the same, just with fewer steps.

It seems reasonable that sorting the items in FFD would interfere with shuffling between epochs, but I haven't been able to find any evidence of that being the case. Testing against a few similarity metrics shows that even when we do the packing at once in one group, shuffling still generates a mostly new set of packs.

Screenshots

Some performance checks below for 1M items.

winglian · 2024-04-11T04:45:33Z

I removed the length estimates around packing in favor of just counting the batches, but let me know if I should add that back in.

I need to do some checking, but the estimates exist due to different processes getting different splits of data, so the actual count of packed samples can vary from process to process. When this happens, you get one process thinking it needs to run another step, but another process thinking it's done and they get out of sync. The estimate was the most sane way I could come up with having each process come up with a deterministic length. I'm open to other ideas to working around this.

src/axolotl/utils/samplers/multipack.py

dsesclei · 2024-04-11T19:03:41Z

Could we generate all the packs, and then evenly split those up (like in the updated multipack.py)? I think each rank should then get an exact number of batches and stay in sync.

winglian · 2024-04-16T23:47:23Z

Could we generate all the packs, and then evenly split those up (like in the updated multipack.py)? I think each rank should then get an exact number of batches and stay in sync.

Perhaps we could do something like dispatch_batches=True to only run the packing on rank 0. I'm not 100% certain of the implications though

NanoCode012 · 2024-04-18T14:52:50Z

Hey, this is very interesting. Should there be some full run comparisons to make sure that there is no loss in performance?

dsesclei · 2024-04-20T00:51:01Z

Perhaps we could do something like dispatch_batches=True to only run the packing on rank 0. I'm not 100% certain of the implications though

Gotcha, for now I'll keep this PR simple by leaving the packing estimates in. Ready for another look.

Hey, this is very interesting. Should there be some full run comparisons to make sure that there is no loss in performance?

Yeah definitely, once the code is greenlit/finalized I'll rent an instance to test it in a distributed setup.

winglian · 2024-05-15T00:15:33Z

src/axolotl/utils/distributed.py

-
-    return distributed_state.use_distributed and distributed_state.initialized
+    global accelerate  # pylint: disable=global-statement
+    if not accelerate:


Hey @dsesclei, Sorry for the delay in getting back to this PR. Is there a particular reason Accelerator was added back rather than using PartialState. It's best to not explicitly load up Accelerator until the last possible moment, and I believe we can get everything about the distributed state from PartialState.

winglian · 2024-05-23T21:34:46Z

Hey @dsesclei we cherry picked and merged your fixes in #1619. Thanks! Would love to give you a shoutout if you're on twitter or discord and could share your handle. thanks!

dsesclei · 2024-05-29T22:08:14Z

Thanks for getting this in Wing! No handles to give, but I appreciate it

winglian · 2024-05-29T22:35:58Z

Thanks @dsesclei, I ended up having to revert the change b/c the loss was off by an order of magnitude. I need to dig into what the multipack sampler is outputting another time to see if there is something obvious that it is doing differently

dsesclei · 2024-05-29T23:23:41Z

Oh gotcha, I'll look into it

dsesclei force-pushed the ds-packing branch 2 times, most recently from e77a87a to 739dd5f Compare April 11, 2024 04:00

winglian reviewed Apr 11, 2024

View reviewed changes

src/axolotl/utils/samplers/multipack.py Outdated Show resolved Hide resolved

dsesclei force-pushed the ds-packing branch 2 times, most recently from 67f1504 to 8c233a0 Compare April 11, 2024 18:38

dsesclei added 2 commits April 19, 2024 20:36

Switch to parallel FFD bin packing algorithm.

d60b72d

Add support for packing in a distributed context.

ae441b6

dsesclei force-pushed the ds-packing branch from 8c233a0 to 6cba9a9 Compare April 20, 2024 00:37

Add packing efficiency estimate back.

9289f1b

dsesclei force-pushed the ds-packing branch from 6cba9a9 to 9289f1b Compare April 20, 2024 00:42

winglian reviewed May 15, 2024

View reviewed changes

winglian mentioned this pull request May 15, 2024

Switch to parallel FFD bin packing algorithm. #1619

Merged

winglian closed this May 23, 2024

dsesclei deleted the ds-packing branch May 29, 2024 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to parallel FFD bin packing algorithm (closes #1492) #1516

Switch to parallel FFD bin packing algorithm (closes #1492) #1516

dsesclei commented Apr 11, 2024 •

edited

winglian commented Apr 11, 2024

dsesclei commented Apr 11, 2024

winglian commented Apr 16, 2024

NanoCode012 commented Apr 18, 2024

dsesclei commented Apr 20, 2024

winglian May 15, 2024

winglian commented May 23, 2024

dsesclei commented May 29, 2024

winglian commented May 29, 2024

dsesclei commented May 29, 2024

Switch to parallel FFD bin packing algorithm (closes #1492) #1516

Switch to parallel FFD bin packing algorithm (closes #1492) #1516

Conversation

dsesclei commented Apr 11, 2024 • edited

Description

Motivation and Context

How has this been tested?

Screenshots

winglian commented Apr 11, 2024

dsesclei commented Apr 11, 2024

winglian commented Apr 16, 2024

NanoCode012 commented Apr 18, 2024

dsesclei commented Apr 20, 2024

winglian May 15, 2024

Choose a reason for hiding this comment

winglian commented May 23, 2024

dsesclei commented May 29, 2024

winglian commented May 29, 2024

dsesclei commented May 29, 2024

dsesclei commented Apr 11, 2024 •

edited