Wireguard pod-to-pod-encryption tests are being run despite the cluster reporting it as disabled, causing a segfault #2262

MTRNord · 2024-01-26T00:30:22Z

Bug report

The tests fail with the following segfault:

[=] Test [pod-to-pod-encryption] [38/63]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x2812648]

goroutine 29259 [running]:
github.com/cilium/cilium-cli/connectivity/tests.getFilter({0x4366ea0, 0xc000aaa080}, 0xc0024fe3c0, 0xc001ab1040, 0xc001ab10c0, 0xc001ab1000, 0xc001ab1080, 0x1, 0x0, 0x0)
	/cilium/connectivity/tests/encryption.go:171 +0x8e8
github.com/cilium/cilium-cli/connectivity/tests.testNoTrafficLeak({0x4366ea0?, 0xc000aaa080}, 0xc0024fe3c0, {0x434a4c8, 0xc0024df440}, 0xc00060fb70?, 0xc001ab1040, 0xc000665b90?, 0x22?, 0x0, ...)
	/cilium/connectivity/tests/encryption.go:381 +0x1dd
github.com/cilium/cilium-cli/connectivity/tests.(*podToPodEncryption).Run.func1(0x2ef1b00?)
	/cilium/connectivity/tests/encryption.go:263 +0x65
github.com/cilium/cilium-cli/connectivity/check.(*Test).ForEachIPFamily(0xc0024fe3c0, 0xc012ad3ce0)
	/cilium/connectivity/check/test.go:808 +0x28e
github.com/cilium/cilium-cli/connectivity/tests.(*podToPodEncryption).Run(0xc0024df440, {0x4366ea0?, 0xc000aaa080}, 0xc0024fe3c0)
	/cilium/connectivity/tests/encryption.go:262 +0x5da
github.com/cilium/cilium-cli/connectivity/check.(*Test).Run(0xc0024fe3c0, {0x4366ea0, 0xc000aaa080}, 0x1b6c225?)
	/cilium/connectivity/check/test.go:329 +0x5fb
github.com/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).Run.func1()
	/cilium/connectivity/check/context.go:405 +0x8c
created by github.com/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).Run in goroutine 52
	/cilium/connectivity/check/context.go:402 +0x266

This is "expected" as the precondition needed (aka the pod running which it tries to access) is not met. However it feels even then weird that this is a segfault rather than an error. Though it shouldnt have went into this in the first place.

General Information

Cilium CLI version (run cilium version)

cilium-cli: v0.15.20 compiled with go1.21.6 on linux/amd64
cilium image (default): v1.14.5
cilium image (stable): v1.14.6
cilium image (running): 1.14.6

Orchestration system version in use (e.g. kubectl version, ...)

Client Version: v1.28.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.2

Platform / infrastructure information (e.g. AWS / Azure / GCP, image / kernel versions)

Bare-metal kubeadm cluster. control plane node is on a gentoo with a 6.1.60 kernel and 2 workers running nixos are on 6.7.1.

The control-plane is x86 and the 2 workers are arm64. All 3 nodes are allowed to run pods

Link to relevant artifacts (policies, deployments scripts, ...)

A lot of info is over at https://cilium.slack.com/archives/C1MATJ5U5/p1706192594540579

Generate and upload a system zip: cilium sysdump

(Hosted via matrix since its 2MB larger than what github allows here :( )
https://matrix.org/_matrix/media/v3/download/midnightthoughts.space/64ef2c6b31d3c8edab052443335f220439e64fb51750678141078077440

How to reproduce the issue

This is rather unclear. However here are some known hints:

The helm chart deployed is:

---
bpf:
  hostLegacyRouting: false
  masquerade: true
cluster:
  # -- Name of the cluster. Only required for Cluster Mesh and mutual authentication with SPIRE.
  name: <redacted>
  # -- (int) Unique ID of the cluster. Must be unique across all connected
  # clusters and in the range of 1 to 255. Only required for Cluster Mesh,
  # may be 0 if Cluster Mesh is not used.
  id: 0
cni:
  customConf: false
  uninstall: false
ipam:
  operator:
    clusterPoolIPv4PodCIDRList:
      - 10.245.0.0/16
    clusterPoolIPv6PodCIDRList:
      - fd00::/104
operator:
  unmanagedPodWatcher:
    restart: true
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
  dashboards:
    enabled: true

policyEnforcementMode: default

kubeProxyReplacement: "true"

routingMode: tunnel
tunnelProtocol: vxlan
#tunnelProtocol: geneve
tunnel: vxlan
tunnelPort: 8473
sessionAffinity: true
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
dashboards:
  enabled: true
hubble:
  relay:
    enabled: true
    prometheus:
      enabled: true
  ui:
    enabled: true
    metrics:
      enabled:
        - dns
        - tcp
        - httpV2
  metrics:
    enableOpenMetrics: true
    enabled:
      - dns:query;ignoreAAAA
      - drop
      - flow
      - flows-to-world
      - httpV2:exemplars=true;labelsContext=source_ip
#      - source_namespace
#      - source_workload
#      - destination_ip
#      - destination_namespace
#      - destination_workload
#      - traffic_direction
      - icmp
      - port-distribution
      - tcp
endpointStatus:
  enabled: true
  status: "policy"

nodePort:
  enabled: false

# Turn on after migration
l2announcements:
  enabled: true
k8sClientRateLimit:
  qps: 50
  burst: 100

k8sServiceHost: <redacted>
k8sServicePort: 6443

ipv6:
  enabled: true
rollOutCiliumPods: true

# Possibly broken
#enableIPv6Masquerade: false

#nat46x64Gateway:
#  enabled: true

The cluster at one point had wireguard encryption between nodes enabled via cilium which didnt work and hence was rolled back on the control plane. Since the nodes were locked out I did remove them via the normal kubeadm way and then readded them under the same node names.

The slack thread lead me to look at https://github.com/cilium/cilium-cli/blob/v0.15.20/connectivity/check/features.go#L185 which presumably is the precondition to be met for the tests to run. All 3 nodes however return:

  "encryption": {
    "mode": "Disabled"
  },

for cilium status -o json being run in the respective pods.

This is the state I am at

The text was updated successfully, but these errors were encountered:

MTRNord · 2024-01-26T10:35:49Z

I believe I got whats going on. The worker nodes have an arm taint. The daemonset however does not allow that. Hence only one of 3 pods is being started. This leads to "serverHost" being an empty variable. Which probably then segfaults

MTRNord · 2024-01-26T10:57:28Z

Ok I confirmed it. the segfault is caused by the daemonset not working nicely with the taints. I will leave this open though as I believe it should be a test failure rather than a segfault :)

MTRNord added the kind/bug Something isn't working label Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wireguard pod-to-pod-encryption tests are being run despite the cluster reporting it as disabled, causing a segfault #2262

Wireguard pod-to-pod-encryption tests are being run despite the cluster reporting it as disabled, causing a segfault #2262

MTRNord commented Jan 26, 2024 •

edited

MTRNord commented Jan 26, 2024 •

edited

MTRNord commented Jan 26, 2024

Wireguard pod-to-pod-encryption tests are being run despite the cluster reporting it as disabled, causing a segfault #2262

Wireguard pod-to-pod-encryption tests are being run despite the cluster reporting it as disabled, causing a segfault #2262

Comments

MTRNord commented Jan 26, 2024 • edited

Bug report

MTRNord commented Jan 26, 2024 • edited

MTRNord commented Jan 26, 2024

MTRNord commented Jan 26, 2024 •

edited

MTRNord commented Jan 26, 2024 •

edited