Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for app failure before updating task infos #464

Open
hungj opened this issue Sep 14, 2020 · 0 comments
Open

Check for app failure before updating task infos #464

hungj opened this issue Sep 14, 2020 · 0 comments
Labels
good first issue Good for newcomers

Comments

@hungj
Copy link
Contributor

hungj commented Sep 14, 2020

If AM fails, TonyClient will hang for a while retrying to connect to AM. We should fail faster here.

14-09-2020 15:15:07 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:07 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 44 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:08 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:08 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:09 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:09 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:10 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:10 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:11 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:11 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 INFO  Client:962 - Retrying connect to server: ltx1-hcl3578.grid.linkedin.com/10.150.55.156:31852. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 FATAL TonyClient:985 - Failed to run TonyClient
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - java.net.ConnectException: Call From ltx1-hcl6554.grid.linkedin.com/10.150.121.188 to ltx1-hcl3578.grid.linkedin.com:31852 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:754)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1547)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.call(Client.java:1489)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.call(Client.java:1388)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.sun.proxy.$Proxy20.getTaskInfos(Unknown Source)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskInfos(TensorFlowClusterPBClientImpl.java:75)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at java.lang.reflect.Method.invoke(Method.java:498)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.sun.proxy.$Proxy21.getTaskInfos(Unknown Source)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskInfos(ApplicationRpcClient.java:81)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.updateTaskInfos(TonyClient.java:895)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:851)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.run(TonyClient.java:185)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.start(TonyClient.java:983)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at com.linkedin.tony.TonyClient.main(TonyClient.java:1097)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - Caused by: java.net.ConnectException: Connection refused
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:701)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:808)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1604)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	at org.apache.hadoop.ipc.Client.call(Client.java:1435)
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 	... 21 more
14-09-2020 15:15:12 PDT mnist-avro-distributed INFO - 2020-09-14 22:15:12 ERROR TonyClient:992 - Application failed to complete successfully
@zuston zuston self-assigned this Mar 30, 2021
@zuston zuston added the good first issue Good for newcomers label Mar 30, 2021
@zuston zuston removed their assignment Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants