Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ping repeatedly from ping thread #383

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

cpt1gl0
Copy link

@cpt1gl0 cpt1gl0 commented Jun 9, 2020

Ping repeatedly from ping thread before assuming error to avoid unnecessary node disconnects.

Copy link

@afalko afalko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement in principal!

I wonder if it is better to put the retry logic in the ping function; it already have a loop that checks on the future, maybe we should make the timeout 1 minute and retry 4 times to mimic current behavior?

src/main/java/hudson/remoting/PingThread.java Outdated Show resolved Hide resolved
src/main/java/hudson/remoting/PingThread.java Show resolved Hide resolved
@jeffret-b
Copy link
Contributor

Try a rebuild after infrastructure issues.

@jeffret-b jeffret-b closed this Jun 9, 2020
@jeffret-b jeffret-b reopened this Jun 9, 2020
@jeffret-b
Copy link
Contributor

I've wanted to look into making this ping thread more resilient. Changing it can be dangerous because there are some odd behaviors associated with it.

Thanks for the submission. I'll try to find some time to take a look at it.

@cpt1gl0
Copy link
Author

cpt1gl0 commented Jun 10, 2020

Re-implemented repeated ping, addressing the mentioned issues and trying to be less invasive:

  • Re-ping 4 times with timeout of 1 minute to mimic old behaviour
  • Keep old constructors
  • Moved retry logic to ping function
  • Use string format instead of joining strings

@jeffret-b
Copy link
Contributor

Looks like there's a compilation failure.

@jeffret-b
Copy link
Contributor

Let's try a CI rebuild to see where we currently are with this.

@jeffret-b jeffret-b closed this Oct 14, 2020
@jeffret-b jeffret-b reopened this Oct 14, 2020
@jeffret-b
Copy link
Contributor

Yes, there is a remaining compilation failure.

@cpt1gl0 , are you interested in getting back to finish this up?

@res0nance
Copy link
Contributor

@cpt1gl0 I hope you don't mind, I took the liberty of resolving the simple issues.

Copy link
Contributor

@jeffret-b jeffret-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but the overall proposal is incomplete. Changes to the ping thread need a Jira ticket, even if they seem to maintain the existing behavior. It would be good to have more information on what this solves and how it will be used.

It may still make sense to accept this change, but it would help to have better background information.

src/main/java/hudson/remoting/PingThread.java Show resolved Hide resolved
src/main/java/hudson/remoting/PingThread.java Outdated Show resolved Hide resolved
@cpt1gl0
Copy link
Author

cpt1gl0 commented Nov 18, 2020

The code looks good, but the overall proposal is incomplete. Changes to the ping thread need a Jira ticket, even if they seem to maintain the existing behavior. It would be good to have more information on what this solves and how it will be used.

It may still make sense to accept this change, but it would help to have better background information.

The changes are intended to solve issues where a node is disconnected because of a single failing ping.

@jeffret-b
Copy link
Contributor

@cpt1gl0, thanks for getting back to this. Builds are passing now so we're in better shape.

Could you clarify a few things about your experience with this change, please?

Are you observing a problem that is fixed by this PR? In other words, are you experiencing an issue "where a node is disconnected because of a single failing ping"? If so, is this PR correcting the problem?

(It will help in accepting this PR if we have a confirmed fix and more testing, though we can still consider it as an improvement without that.)

Could you submit a Jira about the problem this addresses and link it to this PR?

@cpt1gl0
Copy link
Author

cpt1gl0 commented Nov 18, 2020

@jeffret-b I just can tell you that at my company, where we use Jenkins as an integral part of our build environment, we experience problems with disconnecting nodes when the build system is under heavy load. Usually those disconnections are caused by failing pings from ping thread. From that perspective, I think it would make sense to make the ping thread more fault tolerant, and allow for single failing pings without disconnect.

@jeffret-b
Copy link
Contributor

@cpt1gl0, have you been able to try out your improvement, perhaps in a test environment, to see that it improves the situation? If we can get some real-world success with this change I'm less concerned about the risk of introducing it.

@cpt1gl0
Copy link
Author

cpt1gl0 commented Nov 18, 2020

@jeffret-b I can not give you any hard facts. Personally, I think it's a good idea to introduce that change, but it is for you to decide.

@jeffret-b jeffret-b removed the stalled label Nov 19, 2020
@jeffret-b
Copy link
Contributor

I'll see about getting this in some future release. If anyone else would like to review or even better try it out that be very appreciated.

@oleg-nenashev oleg-nenashev requested a review from a team May 8, 2021 07:17
@jtnord
Copy link
Member

jtnord commented May 8, 2021

we experience problems with disconnecting nodes when the build system is under heavy load. Usually those disconnections are caused by failing pings from ping thread

Normally the ping thread is the symptom not the cause.
If you do sable the ping thread entirely (system property so easy to test) does that help you. If not the changes here are likely to be beneficial.

@olivergondza
Copy link
Member

I agree it is desirable to have it confirmed that a thread reporting a ping failure will actually recover and later operates fine.

@cpt1gl0
Copy link
Author

cpt1gl0 commented May 9, 2021

Normally the ping thread is the symptom not the cause.

A totally agree and I think this is not for discussion.

In our case usually a system under heavy load is the original cause, and yes, a failing ping thread is the symptom. For me it's not a question of cause and effect, but more about reasonable measures to take. Does it make sense to disconnect a node after a single ping failure? What's the gain? Why not be a bit more fault tolerant and leave the possibility for the system to recover without any failing builds? If pings keep failing the node would still be disconnected eventually and possible issues will not go unnoticed. Personally I would prefer a system which does not immediately chop off one of its legs because of a numb toe.

@daniel-beck
Copy link
Member

without any failing builds

Not the topic, but still: Consider using Pipeline.

@jeffret-b
Copy link
Contributor

@cpt1gl0 , if you have any experience using this change that results in better stability or results, that would help to move it forward. I agree that it would be better to be more resilient.

@felipecrs
Copy link

I don't know, but perhaps this could help https://issues.jenkins.io/browse/JENKINS-50730

@jeffret-b
Copy link
Contributor

I don't know, but perhaps this could help https://issues.jenkins.io/browse/JENKINS-50730

@felipecrs In what way does that help? Or how is that issue related to the ping thread?

@felipecrs
Copy link

@jeffret-b, I'm sorry, thinking about it now, I think it was a mistake from my side. Please disregard.

@NorseGaud
Copy link

The PR seems to be solving a problem we see as well.

@NorseGaud
Copy link

@cpt1gl0 Did you want me to open a jenkins ticket for this? Or could you?

@jglick
Copy link
Member

jglick commented Jan 20, 2023

Why not be a bit more fault tolerant and leave the possibility for the system to recover without any failing builds?

To expand on #383 (comment), a Pipeline build should recover automatically despite a node disconnection. There are three cases:

  1. The agent disconnects while running sh (or bat or powershell) but the agent machine itself is fine. The agent should reconnect automatically, and the sh step should proceed, including with any output printed during the outage, even if the forked process completed during that time.
  2. Same but during some other “non-durable” step such as checkout scm. That step will fail, but you can arrange for the build to recover automatically
    retry(conditions: [nonresumable()]) {
      node(…) {
        checkout scm
        // as before
      }
    }
  3. The agent machine actually crashed. In that case of course this PR would be irrelevant, but mentioning for completeness; you can still arrange for the build to recover automatically (even across controller restarts as of [JENKINS-49707] Agent missing after controller restart to fail resumption of node step, not kill whole build workflow-durable-task-step-plugin#180):
    retry(conditions: [agent(), nonresumable()]) {
      node(…) {
        // as before
      }
    }

Given these resilience modes, and the fact that you can already configure a longer ping timeout, I am not sure what is gained by retrying a failed ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants