New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add timeout for executor signal #13011
Comments
@agilgur5 Could you help me to look at this issue, Thanks! |
Could you explain or give examples of when this is a problem please? |
The logic of the first piece of code is to send a kubectl exec command using spdy protocol and get its return value. It's going to block down here. In some unusual cases, the goroutine gets stuck here forever. if all the pod clean up workers are block here, something very serious will happen. |
@Joibel You can read what I just commented on. Thank you! |
@Joibel This is already happening in our internal test environment. All pod cleanup workers are blocked, preventing pods from being cleaned. |
I can see that these might block, and that your fix fixes this. |
@Joibel Thanks for your review, This is to execute a kill -15 command, the kill command log is very few and the kill command does not block. I guess it's because there's a problem connecting multiple proxys, but I can't find out why because the pod has actually been cleaned, So I think we need to prevent this from happening, I think this is a very good improvement to optimize the prevention of this kind of situation, what do you think? First of all, maybe no one have noticed this problem, but this problem was discovered by users in the kubernetes community, so he added the ability to cancel the commit, for details you can see this link kubernetes/kubernetes#103177 Second, I think this is a good optimization point, and we should make some preventive improvements in this code. |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
The method ExecPodContainerAndGetOutput in signal.go will stuck, ince this method (GetExecutorOutput) blocks while waiting for the remote stdout output, causing the goroutine to block for a long time, or even for a long time, we should add a timeout to prevent this problem from happening again.
This causes the cleanup to block all four goroutines by default, thus growing indefinitely in the pod_clean_up queue depth
Version
main
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
none
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: