Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture additional clustermesh-related troubleshooting information as part of sysdumps #2531

Merged
merged 7 commits into from
May 15, 2024

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented May 6, 2024

Introduce the collection of additional clustermesh-related information as part of sysdumps, and in particular:

  • The output of the newly introduced clustermesh-apiserver commands:
    • clustermesh-apiserver version (apiserver container)
    • clustermesh-apiserver clustermesh-dbg troubleshoot (apiserver container)
    • clustermesh-apiserver version (kvstoremesh container)
    • clustermesh-apiserver kvstoremesh-dbg status --verbose (kvstoremesh container)
    • clustermesh-apiserver kvstoremesh-dbg status -o json (kvstoremesh container)
    • clustermesh-apiserver kvstoremesh-dbg troubleshoot --include-local (kvstoremesh container)
  • Gops stats and profiling data for the apiserver and kvstoremesh clustermesh-apiserver containers.

Related: cilium/cilium#32165, cilium/cilium#32156, cilium/cilium#32336

Please review commit by commit, and refer to the respective messages for additional information.

Uniform the file name associated with operator and clustermesh-apiserver
metrics, so that they get grouped together when listing the files
alphabetically. The format matches the one used for gops stats.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The same function is used to collect operator metrics as well, hence
let's generalize the error message to not reference clustermesh. The
information about the failing test is still present as part of the
warning message output by the sysdump collection logic.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The clustermesh-apiserver got recently enriched with a set of debug
commands to output the binary version, kvstoremesh connectivity
status to the remote clusters and etcd connectivity troubleshooting
information. Let's additionally collect all this information as part
of the sysdump. Depending on the Cilium version, some of the commands
may not exist, and the corresponding data will not be collected.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's collect the gops stats for both the clustermesh and kvstoremesh
containers of the clustermesh-apiserver, as useful to investigate
possible deadlocks and memory leaks.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Do not attempt to collect metrics and gops stats if the target pod is
not running, or the specified container does not exist. This prevents
performing operations that are guaranteed to fail, and output possibly
misleading errors. One example being when the clustermesh-apiserver is
enabled, but kvstoremesh is not, as all operations targeting the latter
will fail.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Extent the Client implementation introducing a new method which allows
to proxy a connection to a TCP port inside a container. This mimics the
behavior of the port-forward implementation, but it directly provides
access to the forwarded stream (through a ReadWriteCloser interface),
rather than exposing it through a local port.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's additionally collect gops profiling data for both the clustermesh
and kvstoremesh containers of the clustermesh-apiserver, to enable
investigating performance issues. Profiling data is collected proxying
a connection to the remote gops server, to avoid relying on the gops
client. Indeed, it would then save the result to a temporary directory,
and we cannot easily retrieve it given that the target is a distroless
image lacking shell tools.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 marked this pull request as ready for review May 9, 2024 08:05
@giorio94 giorio94 requested review from a team as code owners May 9, 2024 08:05
@giorio94 giorio94 requested review from nathanjsweet, a team and YutaroHayakawa and removed request for a team May 9, 2024 08:05
Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Looks great!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 13, 2024
@michi-covalent michi-covalent merged commit 1c7d1ac into cilium:main May 15, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh area/sysdump ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants