Ping repeatedly from ping thread #383

cpt1gl0 · 2020-06-09T10:25:51Z

Ping repeatedly from ping thread before assuming error to avoid unnecessary node disconnects.

afalko

Nice improvement in principal!

I wonder if it is better to put the retry logic in the ping function; it already have a loop that checks on the future, maybe we should make the timeout 1 minute and retry 4 times to mimic current behavior?

src/main/java/hudson/remoting/PingThread.java

jeffret-b · 2020-06-09T16:13:41Z

Try a rebuild after infrastructure issues.

jeffret-b · 2020-06-09T16:19:38Z

I've wanted to look into making this ping thread more resilient. Changing it can be dangerous because there are some odd behaviors associated with it.

Thanks for the submission. I'll try to find some time to take a look at it.

cpt1gl0 · 2020-06-10T09:00:36Z

Re-implemented repeated ping, addressing the mentioned issues and trying to be less invasive:

Re-ping 4 times with timeout of 1 minute to mimic old behaviour
Keep old constructors
Moved retry logic to ping function
Use string format instead of joining strings

jeffret-b · 2020-07-02T17:14:09Z

Looks like there's a compilation failure.

jeffret-b · 2020-10-14T22:30:27Z

Let's try a CI rebuild to see where we currently are with this.

jeffret-b · 2020-10-14T22:41:25Z

Yes, there is a remaining compilation failure.

@cpt1gl0 , are you interested in getting back to finish this up?

res0nance · 2020-10-29T10:25:30Z

@cpt1gl0 I hope you don't mind, I took the liberty of resolving the simple issues.

jeffret-b

The code looks good, but the overall proposal is incomplete. Changes to the ping thread need a Jira ticket, even if they seem to maintain the existing behavior. It would be good to have more information on what this solves and how it will be used.

It may still make sense to accept this change, but it would help to have better background information.

src/main/java/hudson/remoting/PingThread.java

cpt1gl0 · 2020-11-18T07:19:44Z

The code looks good, but the overall proposal is incomplete. Changes to the ping thread need a Jira ticket, even if they seem to maintain the existing behavior. It would be good to have more information on what this solves and how it will be used.

It may still make sense to accept this change, but it would help to have better background information.

The changes are intended to solve issues where a node is disconnected because of a single failing ping.

jeffret-b · 2020-11-18T16:08:39Z

@cpt1gl0, thanks for getting back to this. Builds are passing now so we're in better shape.

Could you clarify a few things about your experience with this change, please?

Are you observing a problem that is fixed by this PR? In other words, are you experiencing an issue "where a node is disconnected because of a single failing ping"? If so, is this PR correcting the problem?

(It will help in accepting this PR if we have a confirmed fix and more testing, though we can still consider it as an improvement without that.)

Could you submit a Jira about the problem this addresses and link it to this PR?

cpt1gl0 · 2020-11-18T20:38:35Z

@jeffret-b I just can tell you that at my company, where we use Jenkins as an integral part of our build environment, we experience problems with disconnecting nodes when the build system is under heavy load. Usually those disconnections are caused by failing pings from ping thread. From that perspective, I think it would make sense to make the ping thread more fault tolerant, and allow for single failing pings without disconnect.

jeffret-b · 2020-11-18T20:49:55Z

@cpt1gl0, have you been able to try out your improvement, perhaps in a test environment, to see that it improves the situation? If we can get some real-world success with this change I'm less concerned about the risk of introducing it.

cpt1gl0 · 2020-11-18T21:38:39Z

@jeffret-b I can not give you any hard facts. Personally, I think it's a good idea to introduce that change, but it is for you to decide.

jeffret-b · 2020-11-19T16:39:55Z

I'll see about getting this in some future release. If anyone else would like to review or even better try it out that be very appreciated.

jtnord · 2021-05-08T09:28:04Z

we experience problems with disconnecting nodes when the build system is under heavy load. Usually those disconnections are caused by failing pings from ping thread

Normally the ping thread is the symptom not the cause.
If you do sable the ping thread entirely (system property so easy to test) does that help you. If not the changes here are likely to be beneficial.

olivergondza · 2021-05-09T18:37:08Z

I agree it is desirable to have it confirmed that a thread reporting a ping failure will actually recover and later operates fine.

cpt1gl0 · 2021-05-09T20:39:22Z

Normally the ping thread is the symptom not the cause.

A totally agree and I think this is not for discussion.

In our case usually a system under heavy load is the original cause, and yes, a failing ping thread is the symptom. For me it's not a question of cause and effect, but more about reasonable measures to take. Does it make sense to disconnect a node after a single ping failure? What's the gain? Why not be a bit more fault tolerant and leave the possibility for the system to recover without any failing builds? If pings keep failing the node would still be disconnected eventually and possible issues will not go unnoticed. Personally I would prefer a system which does not immediately chop off one of its legs because of a numb toe.

daniel-beck · 2021-05-09T21:13:46Z

without any failing builds

Not the topic, but still: Consider using Pipeline.

jeffret-b · 2021-05-09T23:08:08Z

@cpt1gl0 , if you have any experience using this change that results in better stability or results, that would help to move it forward. I agree that it would be better to be more resilient.

felipecrs · 2021-11-04T16:07:07Z

I don't know, but perhaps this could help https://issues.jenkins.io/browse/JENKINS-50730

jeffret-b · 2021-12-01T00:43:56Z

I don't know, but perhaps this could help https://issues.jenkins.io/browse/JENKINS-50730

@felipecrs In what way does that help? Or how is that issue related to the ping thread?

felipecrs · 2021-12-01T00:45:33Z

@jeffret-b, I'm sorry, thinking about it now, I think it was a mistake from my side. Please disregard.

NorseGaud · 2022-06-03T16:12:17Z

The PR seems to be solving a problem we see as well.

NorseGaud · 2022-06-03T16:15:39Z

@cpt1gl0 Did you want me to open a jenkins ticket for this? Or could you?

jglick · 2023-01-20T15:10:19Z

Why not be a bit more fault tolerant and leave the possibility for the system to recover without any failing builds?

To expand on #383 (comment), a Pipeline build should recover automatically despite a node disconnection. There are three cases:

The agent disconnects while running sh (or bat or powershell) but the agent machine itself is fine. The agent should reconnect automatically, and the sh step should proceed, including with any output printed during the outage, even if the forked process completed during that time.
Same but during some other “non-durable” step such as checkout scm. That step will fail, but you can arrange for the build to recover automatically
```
retry(conditions: [nonresumable()]) {
  node(…) {
    checkout scm
    // as before
  }
}
```
The agent machine actually crashed. In that case of course this PR would be irrelevant, but mentioning for completeness; you can still arrange for the build to recover automatically (even across controller restarts as of [JENKINS-49707] Agent missing after controller restart to fail resumption of node step, not kill whole build workflow-durable-task-step-plugin#180):
```
retry(conditions: [agent(), nonresumable()]) {
  node(…) {
    // as before
  }
}
```

Given these resilience modes, and the fact that you can already configure a longer ping timeout, I am not sure what is gained by retrying a failed ping.

cpt1gl0 added 3 commits June 9, 2020 12:22

ping repeatedly before reporting death from ping thread

6ad70f7

refactoring

4c4cdd3

more refactoring

62970fc

afalko suggested changes Jun 9, 2020

View reviewed changes

src/main/java/hudson/remoting/PingThread.java Outdated Show resolved Hide resolved

src/main/java/hudson/remoting/PingThread.java Show resolved Hide resolved

jeffret-b closed this Jun 9, 2020

jeffret-b reopened this Jun 9, 2020

cpt1gl0 added 4 commits June 10, 2020 09:24

Undo previous changes.

6a0903b

re-implemented ping retry with minimum invasiveness

cab22a2

fixing typo

2a7c8e4

use int for max timeouts

c4d58da

jeffret-b closed this Oct 14, 2020

jeffret-b reopened this Oct 14, 2020

jeffret-b added the stalled label Oct 21, 2020

Resolve compilation errors

3c14b6d

jeffret-b reviewed Oct 29, 2020

View reviewed changes

src/main/java/hudson/remoting/PingThread.java Show resolved Hide resolved

src/main/java/hudson/remoting/PingThread.java Outdated Show resolved Hide resolved

refactoring ping loop

96f1841

formatting

063c8f6

jeffret-b removed the stalled label Nov 19, 2020

oleg-nenashev requested a review from a team May 8, 2021 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ping repeatedly from ping thread #383

Ping repeatedly from ping thread #383

cpt1gl0 commented Jun 9, 2020

afalko left a comment

jeffret-b commented Jun 9, 2020

jeffret-b commented Jun 9, 2020

cpt1gl0 commented Jun 10, 2020

jeffret-b commented Jul 2, 2020

jeffret-b commented Oct 14, 2020

jeffret-b commented Oct 14, 2020

res0nance commented Oct 29, 2020

jeffret-b left a comment

cpt1gl0 commented Nov 18, 2020 •

edited

jeffret-b commented Nov 18, 2020

cpt1gl0 commented Nov 18, 2020

jeffret-b commented Nov 18, 2020

cpt1gl0 commented Nov 18, 2020

jeffret-b commented Nov 19, 2020

jtnord commented May 8, 2021

olivergondza commented May 9, 2021

cpt1gl0 commented May 9, 2021 •

edited

daniel-beck commented May 9, 2021

jeffret-b commented May 9, 2021

felipecrs commented Nov 4, 2021

jeffret-b commented Dec 1, 2021

felipecrs commented Dec 1, 2021

NorseGaud commented Jun 3, 2022

NorseGaud commented Jun 3, 2022

jglick commented Jan 20, 2023

Ping repeatedly from ping thread #383

Are you sure you want to change the base?

Ping repeatedly from ping thread #383

Conversation

cpt1gl0 commented Jun 9, 2020

afalko left a comment

Choose a reason for hiding this comment

jeffret-b commented Jun 9, 2020

jeffret-b commented Jun 9, 2020

cpt1gl0 commented Jun 10, 2020

jeffret-b commented Jul 2, 2020

jeffret-b commented Oct 14, 2020

jeffret-b commented Oct 14, 2020

res0nance commented Oct 29, 2020

jeffret-b left a comment

Choose a reason for hiding this comment

cpt1gl0 commented Nov 18, 2020 • edited

jeffret-b commented Nov 18, 2020

cpt1gl0 commented Nov 18, 2020

jeffret-b commented Nov 18, 2020

cpt1gl0 commented Nov 18, 2020

jeffret-b commented Nov 19, 2020

jtnord commented May 8, 2021

olivergondza commented May 9, 2021

cpt1gl0 commented May 9, 2021 • edited

daniel-beck commented May 9, 2021

jeffret-b commented May 9, 2021

felipecrs commented Nov 4, 2021

jeffret-b commented Dec 1, 2021

felipecrs commented Dec 1, 2021

NorseGaud commented Jun 3, 2022

NorseGaud commented Jun 3, 2022

jglick commented Jan 20, 2023

cpt1gl0 commented Nov 18, 2020 •

edited

cpt1gl0 commented May 9, 2021 •

edited