Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communicator hang fix in the actor loop #132

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

terrykong
Copy link
Collaborator

What does this PR do ?

Fixes a hang observed at the end of the actor training loop

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

  • You can potentially add a usage example below

Here is a self-contained repro that demonstrates the problem
as well as the solution if you apply this PR:

from nemo_aligner.servers.http_communicator import HTTPCommunicator, close_all_communicators

server_dict = {
    'critic_infer': ('critic-8n5qm-worker-0.frameworks.svc.cluster.local', '5567')
}

communicator = HTTPCommunicator.create_http_communicator_from_dict(server_dict)
communicator.send_data_to_server('critic_infer', {'x':[1,2,3]})

import os
if os.environ.get('MANUAL_CLOSE',''):
    close_all_communicators()
print('== end')

Where

python test.py # hangs
MANUAL_CLOSE=1 python test.py # exits normally

Before your PR is "Ready for review"

Pre checks:

Checklist when contributing a new algorithm

  • Does the trainer resume and restore model state all states?
  • Does the trainer support all parallelism techniques(PP, TP, DP)?
  • Does the trainer support max_steps=-1 and validation?
  • Does the trainer only call APIs defined in alignable_interface.py?
  • Does the trainer have proper logging?

Additional Information

  • Related to # (issue)

@terrykong
Copy link
Collaborator Author

TODO: I need to rebase on main, but I've left the branch like this since it's rooted on a previous NGC release commit

@terrykong
Copy link
Collaborator Author

I have confirmed in my environment that the communicators get cleaned up and the program exits without hang:

PPO Global Step: 100%|██████████| 1/1 [2:21:38<00:00, 8498.04s/it, val_global_response_lengths_mean=1024.0, val_global_prompt_lengths=172, val_global_rewards=0.183, rollout_time=3.95e+3, train_time=61.5, validation_time=4.32e+3, train_global_response_lengths_mean=1024.0, train_global_prompt_lengths=153, train_global_rewards=-1.07, train_init_policy_kl=0, train_global_advantages_mean=0.00102, train_global_advantages_std=5.17, train_global_returns_mean=-.878, train_global_returns_std=2.95, train_global_values_mean=-.879, train_global_values_std=5.8, train_consumed_samples=512, train_epoch=1]
[NeMo I 2024-03-23 07:51:02 http_communicator:24] Cleaning up all registered communicators
[NeMo I 2024-03-23 07:51:02 http_communicator:26] Cleaning up communicator: server_name='critic_train' ip='critic-mnqt6-worker-0.frameworks.svc.cluster.local' port=5567
[NeMo I 2024-03-23 07:51:02 http_communicator:26] Cleaning up communicator: server_name='critic_infer' ip='critic-mnqt6-worker-0.frameworks.svc.cluster.local' port=5567
[NeMo I 2024-03-23 07:51:02 http_communicator:26] Cleaning up communicator: server_name='critic_save' ip='critic-mnqt6-worker-0.frameworks.svc.cluster.local' port=5567

@terrykong terrykong marked this pull request as ready for review March 25, 2024 17:31
gshennvm and others added 5 commits April 2, 2024 16:57
Signed-off-by: Gerald Shen <geshen@nvidia.com>
client

Fixes issue were the actor loop creates HTTPCommunicators which create
pytriton.client.FuturesModelClient without the typical `with`
contextmanager syntax. This causes hangs because .close() is not called
on the client. This commit introduces a work-around where each
FuturesModelClient registers itself in a protected global dictionary and
the actor train loop manually calls close_all_communicators() to close
them all. Here is a self-contained repro that demonstrates the problem
as well as the solution:

```python
from nemo_aligner.servers.http_communicator import HTTPCommunicator, close_all_communicators

server_dict = {
    'critic_infer': ('critic-8n5qm-worker-0.frameworks.svc.cluster.local', '5567')
}

communicator = HTTPCommunicator.create_http_communicator_from_dict(server_dict)
communicator.send_data_to_server('critic_infer', {'x':[1,2,3]})

import os
if os.environ.get('MANUAL_CLOSE',''):
    close_all_communicators()
print('== end')
```

Where
```sh
python test.py # hangs
MANUAL_CLOSE=1 python test.py # exits normally
```
This reverts commit 63af985.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants