Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows agent/node has difficulties connecting to scheduler server #3553

Open
ShatalovYaroslav opened this issue May 21, 2019 · 0 comments
Open

Comments

@ShatalovYaroslav
Copy link
Contributor

ShatalovYaroslav commented May 21, 2019

General description:
When using windows agent or windows node, very often the agent does not manage to connect to the scheduler server. It complains of the timeout occurring during the authentication (the method call which times out is authentication.isActivated() inside https://github.com/ow2-proactive/scheduling/blob/master/common/common-client/src/main/java/org/ow2/proactive/authentication/Connection.java)

After this timeout the agent restarts and retries, and most time fails again. It can take up to 1 hour until the agent finally connects. Sometimes a server or agent machine restart is required.

Setup to reproduce:
Server (linux or windows) and Windows Agent/Node (on Windows) are running on different hosts.
Using pnp protocol to connect, agent is started on port 64738.
Firewall config: server and agent machines firewall port opened on 64738 for income and outcome rules.

The investigation already done:
It was possible to reproduce the issue in the beginning. It was enough to open 64738 port in firewall and start proactive-node on windows machine The server could not send reply to node even if the port was opened in firewall. In this case windows node were not added to RM and we observer time-out error on node side. The complete disabling of firewall do not show this issue.

Conclusion after an investigation:
This issue is really hard to understand as it appears randomly. It's related to firewall of node's machine. I had the same firewall rules for opened ports (64738, 64739, 64740) and it was not working for me for several days. Now with the same rules it works constantly after many Agent/Node restarts. It started to work for me after opening in firewall port 1100 and using this port for Agent.
Now I can not reproduce the issue any more. It works for ports opened in firewall.

The error on server side with Debug mode:
error on server with firewall.txt

[2019-05-13 16:28:39,544 -thread-31 ERROR             	p.e.send_reply] pnp://192.168.1.119:64738/RM_NODE/ActiveObject_org.ow2.proactive.resourcemanager.authentication.RMAuthenticationImpl_31b81b69-16ab14a8c27--7ffe--b1dd522cc6337d10-31b81b69-16ab14a8c27--8000 : Failed to send reply to method:isActivated sequence: 1664060699 by pnp://192.168.1.74:64738/HalfbodiesNode_20305421/org.objectweb.proactive.core.body.UniversalBodyRemoteObjectAdapter@28a77031
java.io.IOException: remote object pnp://192.168.1.74:64738/HalfBody_pa.stub.org.ow2.proactive.resourcemanager.nodesource.dataspace._StubDataSpaceNodeConfigurationAgent%23configureNode_32925f64-16ab198468b--7ffe--e7afb9ebf75c0964-32925f64-16ab198468b--8000 not found. Message method=receiveReply, sender=null, sequenceNumber=0 cannot be processed
    at org.objectweb.proactive.extensions.pnp.PNPROMessageRequest.processMessage(PNPROMessageRequest.java:88)
    at org.objectweb.proactive.extensions.pnp.PNPServerHandler$RequestExecutor.run(PNPServerHandler.java:296)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

The error on Windows node side:

[2019-05-10 16:27:10,239       main ERROR] [t] [NODE.Connection.waitAndConnect] org.objectweb.proactive.core.ProActiveTimeoutException: Timeout expired while waiting for the future update
[2019-05-10 16:27:10,240       main ERROR] [t] [NODE.RMNodeStarter.joinResourceManager] Unable to join the Resource Manager at pnp://192.168.1.119:64738
org.ow2.proactive.resourcemanager.exception.RMException: Cannot join the Resource Manager at pnp://192.168.1.119:64738/RMAUTHENTICATION due to Timeout expired while waiting for the future update
	at org.ow2.proactive.resourcemanager.frontend.RMConnection.waitAndJoin(RMConnection.java:100)
	at org.ow2.proactive.resourcemanager.utils.RMNodeStarter.joinResourceManager(RMNodeStarter.java:1173)
	at org.ow2.proactive.resourcemanager.utils.RMNodeStarter.registerInRM(RMNodeStarter.java:1210)
	at org.ow2.proactive.resourcemanager.utils.RMNodeStarter.connectToResourceManager(RMNodeStarter.java:541)
	at org.ow2.proactive.resourcemanager.utils.RMNodeStarter.createNodesAndConnect(RMNodeStarter.java:500)
	at org.ow2.proactive.resourcemanager.utils.RMNodeStarter.main(RMNodeStarter.java:445)
Caused by: org.objectweb.proactive.core.ProActiveTimeoutException: Timeout expired while waiting for the future update
	at org.objectweb.proactive.core.body.future.FutureProxy.waitFor(FutureProxy.java:336)
	at org.objectweb.proactive.api.PAFuture.waitFor(PAFuture.java:193)
	at org.ow2.proactive.authentication.Connection.waitAndConnect(Connection.java:171)
	at org.ow2.proactive.resourcemanager.frontend.RMConnection.waitAndJoin(RMConnection.java:98)
	... 5 more
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant