Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Reliability and Error Handling of remotecommand #2497

Open
learnitall opened this issue Apr 22, 2024 · 0 comments
Open

Improve Reliability and Error Handling of remotecommand #2497

learnitall opened this issue Apr 22, 2024 · 0 comments
Labels
area/CI Continuous Integration testing issue or flake kind/bug Something isn't working

Comments

@learnitall
Copy link
Contributor

The cilium-cli relies on the remotecommand library to interact with Cilium Agents running inside a cluster. As a result, the cilium-cli connectivity tests essentially perform a stress test on the remotecommand library and streaming functionality of the Kubernetes API Server, which has helped identify race conditions upstream. @bimmlerd did a fantastic analysis and opened the following PRs:

In the meantime, while these are being worked on, it makes sense for changes to be made to the cilium-cli to boost the robustness of its use of remotecommand.

This issue tracks defining and implementing a protocol (ie a set of rules) that the cilium-cli will use to detect unexpected errors or race conditions when using the remote executor. To start, kubernetes/kubernetes#124335 can be addressed by something like the following:

  1. Before executing the remote command, we generate a random integer.
  2. Each command that is executed is wrapped in a tiny script which executes the given command and then echos the aforementioned random integer on a newline in both stderr and stdout.
  3. Assert that the full integer is received in both stdout and stderr. If the integer was not received in full, some kind of race condition or flake occurred. From here, we can determine if we need to retry.
  4. Strip the integer out of stdout and stderr and then return back to the user.

Additionally, we want to add onto our radar transitioning to the WebSocketExecutor from the SPDYExecutor. The WebSocketExecutor is described in KEP-4006. Some work has landed in v0.30.0 and v0.29.4 of k8s.io/client-go.

@learnitall learnitall added kind/bug Something isn't working area/CI Continuous Integration testing issue or flake labels Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant