-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Communicator hang fix in the actor loop #132
Open
terrykong
wants to merge
6
commits into
NVIDIA:main
Choose a base branch
from
terrykong:communicator-hang-fix
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
TODO: I need to rebase on main, but I've left the branch like this since it's rooted on a previous NGC release commit |
I have confirmed in my environment that the communicators get cleaned up and the program exits without hang:
|
Signed-off-by: Gerald Shen <geshen@nvidia.com>
client Fixes issue were the actor loop creates HTTPCommunicators which create pytriton.client.FuturesModelClient without the typical `with` contextmanager syntax. This causes hangs because .close() is not called on the client. This commit introduces a work-around where each FuturesModelClient registers itself in a protected global dictionary and the actor train loop manually calls close_all_communicators() to close them all. Here is a self-contained repro that demonstrates the problem as well as the solution: ```python from nemo_aligner.servers.http_communicator import HTTPCommunicator, close_all_communicators server_dict = { 'critic_infer': ('critic-8n5qm-worker-0.frameworks.svc.cluster.local', '5567') } communicator = HTTPCommunicator.create_http_communicator_from_dict(server_dict) communicator.send_data_to_server('critic_infer', {'x':[1,2,3]}) import os if os.environ.get('MANUAL_CLOSE',''): close_all_communicators() print('== end') ``` Where ```sh python test.py # hangs MANUAL_CLOSE=1 python test.py # exits normally ```
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
terrykong
force-pushed
the
communicator-hang-fix
branch
from
April 2, 2024 23:57
c0b2e79
to
c23d786
Compare
This reverts commit 63af985.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Fixes a hang observed at the end of the actor training loop
Changelog
Usage
Here is a self-contained repro that demonstrates the problem
as well as the solution if you apply this PR:
Where
Before your PR is "Ready for review"
Pre checks:
Checklist when contributing a new algorithm
max_steps=-1
andvalidation
?Additional Information