Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AM hangs when running tests #388

Open
hungj opened this issue Oct 3, 2019 · 0 comments
Open

AM hangs when running tests #388

hungj opened this issue Oct 3, 2019 · 0 comments

Comments

@hungj
Copy link
Contributor

hungj commented Oct 3, 2019

For example, it hangs here: testTensorBoardPortSetOnlyOnChiefWorker

2019-10-02 23:44:28 INFO  ApplicationMaster:886 - Received result registration request with exit code 0 from chief 0
2019-10-02 23:44:28 INFO  ApplicationMaster:893 - Unregistering task [chief:0] from Heartbeat monitor..
2019-10-02 23:44:31 INFO  ApplicationMaster:851 - All 3 tasks registered.
2019-10-02 23:44:31 INFO  ApplicationMaster:851 - All 3 tasks registered.
2019-10-02 23:44:31 INFO  ApplicationMaster:886 - Received result registration request with exit code 0 from worker 0
2019-10-02 23:44:31 INFO  ApplicationMaster:893 - Unregistering task [worker:0] from Heartbeat monitor..
2019-10-02 23:44:31 INFO  ApplicationMaster:886 - Received result registration request with exit code 0 from ps 0
2019-10-02 23:44:31 INFO  ApplicationMaster:893 - Unregistering task [ps:0] from Heartbeat monitor..
2019-10-02 23:44:59 INFO  Client:871 - Retrying connect to server: . Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-10-02 23:45:00 INFO  Client:871 - Retrying connect to server: . Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-10-02 23:45:01 INFO  Client:871 - Retrying connect to server: . Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-10-02 23:45:02 INFO  Client:871 - Retrying connect to server: . Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-10-02 23:45:03 INFO  Client:871 - Retrying connect to server: . Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

All three tasks unregistered with AM, but then ipc.Client retry policy kicks in which causes test to exceed 10 min timeout. Seems it's trying to talk to RM, probably to unregisterApplicationMaster, but can't contact it for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant