Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max length padding/truncating #55

Closed
wants to merge 1 commit into from
Closed

Conversation

TimotheeMickus
Copy link
Collaborator

Load unbalance is a very likely candidates for the scaling issues we faced. This PR introduces a couple new flags to enforce equal load across nodes, which seems to result in reasonable communication performances.

This can be achieved with the :

  • during training, enable the truncation/padding with the --pad_to_max_length
  • set a --max_length to which the input tensors will be trimmed (hence the sequence dimension of input tensors will be constant)
  • set --batch_type "sents" (hence the batch dimension of input tensors will be constant), along with a low enough batch size.

Using a max length of 128 and a batch size of 300, we get 6k tok/sec on a 4-GPUs / 1 node Europarl job, with a fairly constant GPU utilization rate (90 to 100%).

Todo's before undrafting this PR

  • replace max_length with actual src seq length and tgt seq length, if not provided
  • ensure this works across multiple architectures
  • unit testing
  • maybe consider communicating ahead of the forward loop the effective max sequence lengths across processes, so as limit the actual processing required
  • consider enabling load-balancing measures in the token-level batching functions:
    • make a stateful numel_fn, allow it to track the max lengths across the batch items,
    • forbid sampling from buckets for source/ target lengths that would put the batch over the max load
    • break when the intended number of true tokens is reached / when there are no example to select that would be compatile with the max load

Copy link
Collaborator

@Waino Waino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's merge this, although it is possibly obsoleted by better dynamic batching.

@Waino
Copy link
Collaborator

Waino commented May 13, 2024

See #67

@Waino Waino closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants