Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible circular dependency in ACL management #3738

Open
schoeppi5 opened this issue Mar 13, 2024 · 1 comment
Open

Possible circular dependency in ACL management #3738

schoeppi5 opened this issue Mar 13, 2024 · 1 comment
Labels
type/bug Something isn't working

Comments

@schoeppi5
Copy link

schoeppi5 commented Mar 13, 2024

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

I ran into several issues when installing a new Consul cluster on Kubernetes using the Helm chart. When configuring the Helm chart to manage ACLs for the cluster, I found a circular dependency, which prevented me from installing the cluster without modifying the manifests using kustomize.

Reproduction Steps

In order to effectively and quickly resolve the issue, please provide exact steps that allow us the reproduce the problem. If no steps are provided, then it will likely take longer to get the issue resolved. An example that you can follow is provided below.

Steps to reproduce:

Install the helm chart using the following values:

global:
  name: consul
  secretsBackend:
    vault:
      enabled: false
  acls:
    manageSystemACLs: true,
    bootstrapToken:
      secretName: bootstrap-token-secret
      secretKey: bootstrap-token
server:
  replicas: 3
  connect: false
  exposeService:
    enabled: false
ui:
  enabled: true
  service:
    enabled: true
  ingress:
    enabled: false
syncCatalog:
  enabled: false
connectInject:
  enabled: false
meshGetaway:
  enabled: false
ingressGateways:
  enabled: false
terminatingGateways:
  enabled: false
tests:
  enabled: false
telemetryCollector:
  enabled: false

Actual behavior

This will lead to two things:

  1. The server's StatefulSet will try to load the bootstrap-token from the secret
  • If the secret does not exist (which it doesn't for a clean install), this will prevent the server from starting
  1. The consul-k8s-controlplane server-acl-init job will not complete, since it can't resolve the DNS name of the headless server, since no pods are running

I realized, that the .global.acls.bootstrapToken config option is probably meant to be set, if you already have a bootstrap token. This should be made clearer in the docs, if this is the case.

One possible workaroung I tried was creating an empty secret with the same name, which allows the consul servers to start and the server-acl-init job to successfully initialize the ACL systems, but the job ultimately fails, since it always tries to create the secret and not update it, contradicting the documentation.

The other possibility is to remove .global.acls.bootstrapToken, which removes the env var from the StatefulSet. This works, the job is initializing the ACL system and creates the secret with the token, but I ran into another issue with the server-acl-init job:
The job resolves the IPs of the consul servers from the DNS name of the headless service, which contains a race condition, because the DNS only returns IP addresses from started Pods (obviously). In my testing it was almost always the case, that the job only received one or two out of three IP addresses, leading to only these consul servers receiving a server token and the remaining servers unable to communicate using acl. A fix for that would be for the job to know, how many servers to expect (like the -bootstrap-expect).

Workaround solution

I solved these issues with a kustomization patch to the acl init job:

apiVersion: batch/v1
kind: Job
metadata:
  name: kustomize
spec:
  template:
    spec:
      containers:
      - name: server-acl-init-job
        env: 
        - name: CONSUL_ADDRESSES
          value: "exec=/nslookup.sh consul-server"
        - name: CONSUL_TLS_SERVER_NAME
          value: "consul-server"
        volumeMounts:
        - name: nslookup
          mountPath: /nslookup.sh
          subPath: nslookup.sh
      volumes:
      - name: nslookup
        configMap:
          name: consul-nslookup-exec
          items:
          - key: nslookup.sh
            path: nslookup.sh
          defaultMode: 0777

and the nslookup.sh script:

#!/bin/sh

nslookup $1 | awk 'BEGIN {count=0} /^Address:/ {count++; if (count > 1) printf "%s ", $2} END {if (count-1 != 3) {printf "Less than three (%s) addresses found for consul", count-1; exit 1}}'

Environment details

If not already included, please provide the following:

  • consul-k8s version: 1.4.0

Additionally, please provide details regarding the Kubernetes Infrastructure, as shown below:

  • Kubernetes version: v1.29 (eks)
  • Cloud Provider: EKS
  • Networking CNI plugin in use: AWS-VPC-CNI

Philipp Sch枚ppner <philipp.schoeppner@mercedes-benz.com>, Mercedes-Benz Tech Innovation GmbH (Provider Information)

@schoeppi5 schoeppi5 added the type/bug Something isn't working label Mar 13, 2024
@schreibergeorg
Copy link

馃憤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants