Communicator hang fix in the actor loop #132

terrykong · 2024-03-22T16:22:40Z

What does this PR do ?

Fixes a hang observed at the end of the actor training loop

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

You can potentially add a usage example below

Here is a self-contained repro that demonstrates the problem
as well as the solution if you apply this PR:

from nemo_aligner.servers.http_communicator import HTTPCommunicator, close_all_communicators

server_dict = {
    'critic_infer': ('critic-8n5qm-worker-0.frameworks.svc.cluster.local', '5567')
}

communicator = HTTPCommunicator.create_http_communicator_from_dict(server_dict)
communicator.send_data_to_server('critic_infer', {'x':[1,2,3]})

import os
if os.environ.get('MANUAL_CLOSE',''):
    close_all_communicators()
print('== end')

Where

python test.py # hangs
MANUAL_CLOSE=1 python test.py # exits normally

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

terrykong · 2024-03-22T16:23:31Z

TODO: I need to rebase on main, but I've left the branch like this since it's rooted on a previous NGC release commit

terrykong · 2024-03-25T17:31:41Z

I have confirmed in my environment that the communicators get cleaned up and the program exits without hang:

PPO Global Step: 100%|██████████| 1/1 [2:21:38<00:00, 8498.04s/it, val_global_response_lengths_mean=1024.0, val_global_prompt_lengths=172, val_global_rewards=0.183, rollout_time=3.95e+3, train_time=61.5, validation_time=4.32e+3, train_global_response_lengths_mean=1024.0, train_global_prompt_lengths=153, train_global_rewards=-1.07, train_init_policy_kl=0, train_global_advantages_mean=0.00102, train_global_advantages_std=5.17, train_global_returns_mean=-.878, train_global_returns_std=2.95, train_global_values_mean=-.879, train_global_values_std=5.8, train_consumed_samples=512, train_epoch=1]
[NeMo I 2024-03-23 07:51:02 http_communicator:24] Cleaning up all registered communicators
[NeMo I 2024-03-23 07:51:02 http_communicator:26] Cleaning up communicator: server_name='critic_train' ip='critic-mnqt6-worker-0.frameworks.svc.cluster.local' port=5567
[NeMo I 2024-03-23 07:51:02 http_communicator:26] Cleaning up communicator: server_name='critic_infer' ip='critic-mnqt6-worker-0.frameworks.svc.cluster.local' port=5567
[NeMo I 2024-03-23 07:51:02 http_communicator:26] Cleaning up communicator: server_name='critic_save' ip='critic-mnqt6-worker-0.frameworks.svc.cluster.local' port=5567

Signed-off-by: Gerald Shen <geshen@nvidia.com>

client Fixes issue were the actor loop creates HTTPCommunicators which create pytriton.client.FuturesModelClient without the typical `with` contextmanager syntax. This causes hangs because .close() is not called on the client. This commit introduces a work-around where each FuturesModelClient registers itself in a protected global dictionary and the actor train loop manually calls close_all_communicators() to close them all. Here is a self-contained repro that demonstrates the problem as well as the solution: ```python from nemo_aligner.servers.http_communicator import HTTPCommunicator, close_all_communicators server_dict = { 'critic_infer': ('critic-8n5qm-worker-0.frameworks.svc.cluster.local', '5567') } communicator = HTTPCommunicator.create_http_communicator_from_dict(server_dict) communicator.send_data_to_server('critic_infer', {'x':[1,2,3]}) import os if os.environ.get('MANUAL_CLOSE',''): close_all_communicators() print('== end') ``` Where ```sh python test.py # hangs MANUAL_CLOSE=1 python test.py # exits normally ```

for more information, see https://pre-commit.ci

This reverts commit 63af985.

github-actions bot added Utils Servers labels Mar 22, 2024

terrykong marked this pull request as ready for review March 25, 2024 17:31

gshennvm and others added 5 commits April 2, 2024 16:57

upgrade version

63af985

Signed-off-by: Gerald Shen <geshen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

08af808

for more information, see https://pre-commit.ci

missing import

8bcfe7c

[pre-commit.ci] auto fixes from pre-commit.com hooks

c23d786

for more information, see https://pre-commit.ci

terrykong force-pushed the communicator-hang-fix branch from c0b2e79 to c23d786 Compare April 2, 2024 23:57

github-actions bot removed the Utils label Apr 2, 2024

Revert "upgrade version"

86ade4f

This reverts commit 63af985.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Communicator hang fix in the actor loop #132

Communicator hang fix in the actor loop #132

terrykong commented Mar 22, 2024

terrykong commented Mar 22, 2024

terrykong commented Mar 25, 2024

Communicator hang fix in the actor loop #132

Are you sure you want to change the base?

Communicator hang fix in the actor loop #132

Conversation

terrykong commented Mar 22, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

terrykong commented Mar 22, 2024

terrykong commented Mar 25, 2024