Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wireguard pod-to-pod-encryption tests are being run despite the cluster reporting it as disabled, causing a segfault #2262

Open
MTRNord opened this issue Jan 26, 2024 · 2 comments
Labels
kind/bug Something isn't working

Comments

@MTRNord
Copy link

MTRNord commented Jan 26, 2024

Bug report

The tests fail with the following segfault:

[=] Test [pod-to-pod-encryption] [38/63]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x2812648]

goroutine 29259 [running]:
github.com/cilium/cilium-cli/connectivity/tests.getFilter({0x4366ea0, 0xc000aaa080}, 0xc0024fe3c0, 0xc001ab1040, 0xc001ab10c0, 0xc001ab1000, 0xc001ab1080, 0x1, 0x0, 0x0)
	/cilium/connectivity/tests/encryption.go:171 +0x8e8
github.com/cilium/cilium-cli/connectivity/tests.testNoTrafficLeak({0x4366ea0?, 0xc000aaa080}, 0xc0024fe3c0, {0x434a4c8, 0xc0024df440}, 0xc00060fb70?, 0xc001ab1040, 0xc000665b90?, 0x22?, 0x0, ...)
	/cilium/connectivity/tests/encryption.go:381 +0x1dd
github.com/cilium/cilium-cli/connectivity/tests.(*podToPodEncryption).Run.func1(0x2ef1b00?)
	/cilium/connectivity/tests/encryption.go:263 +0x65
github.com/cilium/cilium-cli/connectivity/check.(*Test).ForEachIPFamily(0xc0024fe3c0, 0xc012ad3ce0)
	/cilium/connectivity/check/test.go:808 +0x28e
github.com/cilium/cilium-cli/connectivity/tests.(*podToPodEncryption).Run(0xc0024df440, {0x4366ea0?, 0xc000aaa080}, 0xc0024fe3c0)
	/cilium/connectivity/tests/encryption.go:262 +0x5da
github.com/cilium/cilium-cli/connectivity/check.(*Test).Run(0xc0024fe3c0, {0x4366ea0, 0xc000aaa080}, 0x1b6c225?)
	/cilium/connectivity/check/test.go:329 +0x5fb
github.com/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).Run.func1()
	/cilium/connectivity/check/context.go:405 +0x8c
created by github.com/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).Run in goroutine 52
	/cilium/connectivity/check/context.go:402 +0x266

This is "expected" as the precondition needed (aka the pod running which it tries to access) is not met. However it feels even then weird that this is a segfault rather than an error. Though it shouldnt have went into this in the first place.

General Information

  • Cilium CLI version (run cilium version)
cilium-cli: v0.15.20 compiled with go1.21.6 on linux/amd64
cilium image (default): v1.14.5
cilium image (stable): v1.14.6
cilium image (running): 1.14.6
  • Orchestration system version in use (e.g. kubectl version, ...)
Client Version: v1.28.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.2
  • Platform / infrastructure information (e.g. AWS / Azure / GCP, image / kernel versions)

Bare-metal kubeadm cluster. control plane node is on a gentoo with a 6.1.60 kernel and 2 workers running nixos are on 6.7.1.

The control-plane is x86 and the 2 workers are arm64. All 3 nodes are allowed to run pods

  • Link to relevant artifacts (policies, deployments scripts, ...)

A lot of info is over at https://cilium.slack.com/archives/C1MATJ5U5/p1706192594540579

  • Generate and upload a system zip: cilium sysdump

(Hosted via matrix since its 2MB larger than what github allows here :( )
https://matrix.org/_matrix/media/v3/download/midnightthoughts.space/64ef2c6b31d3c8edab052443335f220439e64fb51750678141078077440

How to reproduce the issue

This is rather unclear. However here are some known hints:

The helm chart deployed is:

---
bpf:
  hostLegacyRouting: false
  masquerade: true
cluster:
  # -- Name of the cluster. Only required for Cluster Mesh and mutual authentication with SPIRE.
  name: <redacted>
  # -- (int) Unique ID of the cluster. Must be unique across all connected
  # clusters and in the range of 1 to 255. Only required for Cluster Mesh,
  # may be 0 if Cluster Mesh is not used.
  id: 0
cni:
  customConf: false
  uninstall: false
ipam:
  operator:
    clusterPoolIPv4PodCIDRList:
      - 10.245.0.0/16
    clusterPoolIPv6PodCIDRList:
      - fd00::/104
operator:
  unmanagedPodWatcher:
    restart: true
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
  dashboards:
    enabled: true

policyEnforcementMode: default

kubeProxyReplacement: "true"

routingMode: tunnel
tunnelProtocol: vxlan
#tunnelProtocol: geneve
tunnel: vxlan
tunnelPort: 8473
sessionAffinity: true
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
dashboards:
  enabled: true
hubble:
  relay:
    enabled: true
    prometheus:
      enabled: true
  ui:
    enabled: true
    metrics:
      enabled:
        - dns
        - tcp
        - httpV2
  metrics:
    enableOpenMetrics: true
    enabled:
      - dns:query;ignoreAAAA
      - drop
      - flow
      - flows-to-world
      - httpV2:exemplars=true;labelsContext=source_ip
#      - source_namespace
#      - source_workload
#      - destination_ip
#      - destination_namespace
#      - destination_workload
#      - traffic_direction
      - icmp
      - port-distribution
      - tcp
endpointStatus:
  enabled: true
  status: "policy"

nodePort:
  enabled: false

# Turn on after migration
l2announcements:
  enabled: true
k8sClientRateLimit:
  qps: 50
  burst: 100

k8sServiceHost: <redacted>
k8sServicePort: 6443

ipv6:
  enabled: true
rollOutCiliumPods: true

# Possibly broken
#enableIPv6Masquerade: false

#nat46x64Gateway:
#  enabled: true

The cluster at one point had wireguard encryption between nodes enabled via cilium which didnt work and hence was rolled back on the control plane. Since the nodes were locked out I did remove them via the normal kubeadm way and then readded them under the same node names.

The slack thread lead me to look at https://github.com/cilium/cilium-cli/blob/v0.15.20/connectivity/check/features.go#L185 which presumably is the precondition to be met for the tests to run. All 3 nodes however return:

  "encryption": {
    "mode": "Disabled"
  },

for cilium status -o json being run in the respective pods.

This is the state I am at

@MTRNord MTRNord added the kind/bug Something isn't working label Jan 26, 2024
@MTRNord
Copy link
Author

MTRNord commented Jan 26, 2024

I believe I got whats going on. The worker nodes have an arm taint. The daemonset however does not allow that. Hence only one of 3 pods is being started. This leads to "serverHost" being an empty variable. Which probably then segfaults

@MTRNord
Copy link
Author

MTRNord commented Jan 26, 2024

Ok I confirmed it. the segfault is caused by the daemonset not working nicely with the taints. I will leave this open though as I believe it should be a test failure rather than a segfault :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant