Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provisionning fails with RKE 1.5.9 when extra_env is used for kube-api service #3587

Open
tmsdce opened this issue May 15, 2024 · 3 comments
Assignees
Milestone

Comments

@tmsdce
Copy link

tmsdce commented May 15, 2024

Description

I'm trying to create a cluster with extra_env specified for the kube-api service. This works fine with RKE 1.5.8 but fails with RKE 1.5.9. I think the following commit might be involved : 2e767c8

RKE version: 1.5.9

Docker version: (docker version,docker info preferred)

Client: Docker Engine - Community
 Version:    25.0.5
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 3
  Running: 2
  Paused: 0
  Stopped: 1
 Images: 10
 Server Version: 25.0.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e377cd56a71523140ca6ae87e30244719194a521
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-35-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 5.778GiB
 Name: ubu22
 ID: 7bc42587-26ab-4f66-83ec-56004d44d654
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 33
  Goroutines: 53
  System Time: 2024-05-15T15:03:08.602809828Z
  EventsListeners: 0
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
6.5.0-35-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

VirtualBox but also fails on vSphere (the RKE terraform provider is used in this case)

cluster.yml file:

kubernetes_version: v1.28.8-rancher1-1
authentication:
  sans:
  - ubu22
  - 192.168.56.20
cloud_provider:
  name: ""
cluster_name: ubu22-k8s
enable_cri_dockerd: true
ingress:
  provider: none
network:
  plugin: calico
  options:
    calico_cloud_provider: none
    calico_flex_volume_plugin_dir: /var/lib/kubelet/volumeplugins
nodes:
- address: 192.168.56.20
  hostname_override: ubu22
  internal_address: 192.168.56.20
  role:
  - controlplane
  - etcd
  - worker
  ssh_key_path: "/home/tsde/.ssh/id_ed25519"
  user: tsde
services:
  kube-api:
    extra_env:
    - TEST_VAR=test
  kube-controller:
    cluster_cidr: 10.42.0.0/16
    extra_args:
      cluster-signing-cert-file: /etc/kubernetes/ssl/kube-ca.pem
      cluster-signing-key-file: /etc/kubernetes/ssl/kube-ca-key.pem
      flex-volume-plugin-dir: /var/lib/kubelet/volumeplugins
      terminated-pod-gc-threshold: "15"
upgrade_strategy:
  drain: true
  max_unavailable_controlplane: "1"
  max_unavailable_worker: "1"
  node_drain_input:
    delete_local_data: true
    force: true
    grace_period: 0
    ignore_daemonsets: true
    timeout: 0

Steps to Reproduce:

  • Simply run rke up and wait for it to crash when trying to start the kube-apiserver container (see logs below)

Results:

INFO[0000] Running RKE version: v1.5.9                  
INFO[0000] Initiating Kubernetes cluster                
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates 
INFO[0000] [certificates] Generating admin certificates and kubeconfig 
INFO[0000] Successfully Deployed state file at [./cluster.rkestate] 
INFO[0000] Building Kubernetes cluster                  
INFO[0000] [dialer] Setup tunnel for host [192.168.56.20] 
INFO[0000] [network] Deploying port listener containers 
INFO[0000] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0001] Starting container [rke-etcd-port-listener] on host [192.168.56.20], try #1 
INFO[0002] [network] Successfully started [rke-etcd-port-listener] container on host [192.168.56.20] 
INFO[0002] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0002] Starting container [rke-cp-port-listener] on host [192.168.56.20], try #1 
INFO[0002] [network] Successfully started [rke-cp-port-listener] container on host [192.168.56.20] 
INFO[0002] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0003] Starting container [rke-worker-port-listener] on host [192.168.56.20], try #1 
INFO[0003] [network] Successfully started [rke-worker-port-listener] container on host [192.168.56.20] 
INFO[0003] [network] Port listener containers deployed successfully 
INFO[0003] [network] Running control plane -> etcd port checks 
INFO[0003] [network] Checking if host [192.168.56.20] can connect to host(s) [192.168.56.20] on port(s) [2379], try #1 
INFO[0003] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0004] Starting container [rke-port-checker] on host [192.168.56.20], try #1 
INFO[0004] [network] Successfully started [rke-port-checker] container on host [192.168.56.20] 
INFO[0004] Removing container [rke-port-checker] on host [192.168.56.20], try #1 
INFO[0004] [network] Running control plane -> worker port checks 
INFO[0004] [network] Checking if host [192.168.56.20] can connect to host(s) [192.168.56.20] on port(s) [10250], try #1 
INFO[0004] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0004] Starting container [rke-port-checker] on host [192.168.56.20], try #1 
INFO[0005] [network] Successfully started [rke-port-checker] container on host [192.168.56.20] 
INFO[0005] Removing container [rke-port-checker] on host [192.168.56.20], try #1 
INFO[0005] [network] Running workers -> control plane port checks 
INFO[0005] [network] Checking if host [192.168.56.20] can connect to host(s) [192.168.56.20] on port(s) [6443], try #1 
INFO[0005] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0005] Starting container [rke-port-checker] on host [192.168.56.20], try #1 
INFO[0005] [network] Successfully started [rke-port-checker] container on host [192.168.56.20] 
INFO[0006] Removing container [rke-port-checker] on host [192.168.56.20], try #1 
INFO[0006] [network] Checking KubeAPI port Control Plane hosts 
INFO[0006] [network] Removing port listener containers  
INFO[0006] Removing container [rke-etcd-port-listener] on host [192.168.56.20], try #1 
INFO[0007] [remove/rke-etcd-port-listener] Successfully removed container on host [192.168.56.20] 
INFO[0007] Removing container [rke-cp-port-listener] on host [192.168.56.20], try #1 
INFO[0007] [remove/rke-cp-port-listener] Successfully removed container on host [192.168.56.20] 
INFO[0007] Removing container [rke-worker-port-listener] on host [192.168.56.20], try #1 
INFO[0008] [remove/rke-worker-port-listener] Successfully removed container on host [192.168.56.20] 
INFO[0008] [network] Port listener containers removed successfully 
INFO[0008] [certificates] Deploying kubernetes certificates to Cluster nodes 
INFO[0008] Finding container [cert-deployer] on host [192.168.56.20], try #1 
INFO[0008] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0008] Starting container [cert-deployer] on host [192.168.56.20], try #1 
INFO[0008] Finding container [cert-deployer] on host [192.168.56.20], try #1 
INFO[0013] Finding container [cert-deployer] on host [192.168.56.20], try #1 
INFO[0013] Removing container [cert-deployer] on host [192.168.56.20], try #1 
INFO[0013] [reconcile] Rebuilding and updating local kube config 
INFO[0013] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml] 
WARN[0013] [reconcile] host [192.168.56.20] is a control plane node without reachable Kubernetes API endpoint in the cluster 
WARN[0013] [reconcile] no control plane node with reachable Kubernetes API endpoint in the cluster found 
INFO[0013] [certificates] Successfully deployed kubernetes certificates to Cluster nodes 
INFO[0013] [file-deploy] Deploying file [/etc/kubernetes/admission.yaml] to node [192.168.56.20] 
INFO[0013] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0014] Starting container [file-deployer] on host [192.168.56.20], try #1 
INFO[0014] Successfully started [file-deployer] container on host [192.168.56.20] 
INFO[0014] Waiting for [file-deployer] container to exit on host [192.168.56.20] 
INFO[0014] Waiting for [file-deployer] container to exit on host [192.168.56.20] 
INFO[0014] Container [file-deployer] is still running on host [192.168.56.20]: stderr: [], stdout: [] 
INFO[0015] Removing container [file-deployer] on host [192.168.56.20], try #1 
INFO[0015] [remove/file-deployer] Successfully removed container on host [192.168.56.20] 
INFO[0015] [/etc/kubernetes/admission.yaml] Successfully deployed admission control config to Cluster control nodes 
INFO[0015] [file-deploy] Deploying file [/etc/kubernetes/audit-policy.yaml] to node [192.168.56.20] 
INFO[0015] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0016] Starting container [file-deployer] on host [192.168.56.20], try #1 
INFO[0016] Successfully started [file-deployer] container on host [192.168.56.20] 
INFO[0016] Waiting for [file-deployer] container to exit on host [192.168.56.20] 
INFO[0016] Waiting for [file-deployer] container to exit on host [192.168.56.20] 
INFO[0017] Removing container [file-deployer] on host [192.168.56.20], try #1 
INFO[0017] [remove/file-deployer] Successfully removed container on host [192.168.56.20] 
INFO[0017] [/etc/kubernetes/audit-policy.yaml] Successfully deployed audit policy file to Cluster control nodes 
INFO[0017] [reconcile] Reconciling cluster state        
INFO[0017] [reconcile] This is newly generated cluster  
INFO[0017] Pre-pulling kubernetes images                
INFO[0017] Image [rancher/hyperkube:v1.28.8-rancher1] exists on host [192.168.56.20] 
INFO[0017] Image [rancher/mirrored-pause:3.7] exists on host [192.168.56.20] 
INFO[0017] Kubernetes images pulled successfully        
INFO[0017] [etcd] Building up etcd plane..              
INFO[0017] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0017] Starting container [etcd-fix-perm] on host [192.168.56.20], try #1 
INFO[0017] Successfully started [etcd-fix-perm] container on host [192.168.56.20] 
INFO[0017] Waiting for [etcd-fix-perm] container to exit on host [192.168.56.20] 
INFO[0017] Waiting for [etcd-fix-perm] container to exit on host [192.168.56.20] 
INFO[0018] Removing container [etcd-fix-perm] on host [192.168.56.20], try #1 
INFO[0018] [remove/etcd-fix-perm] Successfully removed container on host [192.168.56.20] 
INFO[0018] Image [rancher/mirrored-coreos-etcd:v3.5.10] exists on host [192.168.56.20] 
INFO[0018] Starting container [etcd] on host [192.168.56.20], try #1 
INFO[0018] [etcd] Successfully started [etcd] container on host [192.168.56.20] 
INFO[0018] [etcd] Running rolling snapshot container [etcd-rolling-snapshots] on host [192.168.56.20] 
INFO[0018] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0019] Starting container [etcd-rolling-snapshots] on host [192.168.56.20], try #1 
INFO[0019] [etcd] Successfully started [etcd-rolling-snapshots] container on host [192.168.56.20] 
INFO[0024] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0024] Starting container [rke-bundle-cert] on host [192.168.56.20], try #1 
INFO[0025] [certificates] Successfully started [rke-bundle-cert] container on host [192.168.56.20] 
INFO[0025] Waiting for [rke-bundle-cert] container to exit on host [192.168.56.20] 
INFO[0025] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [192.168.56.20] 
INFO[0025] Removing container [rke-bundle-cert] on host [192.168.56.20], try #1 
INFO[0025] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0026] Starting container [rke-log-linker] on host [192.168.56.20], try #1 
INFO[0026] [etcd] Successfully started [rke-log-linker] container on host [192.168.56.20] 
INFO[0026] Removing container [rke-log-linker] on host [192.168.56.20], try #1 
INFO[0026] [remove/rke-log-linker] Successfully removed container on host [192.168.56.20] 
INFO[0026] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0027] Starting container [rke-log-linker] on host [192.168.56.20], try #1 
INFO[0027] [etcd] Successfully started [rke-log-linker] container on host [192.168.56.20] 
INFO[0027] Removing container [rke-log-linker] on host [192.168.56.20], try #1 
INFO[0027] [remove/rke-log-linker] Successfully removed container on host [192.168.56.20] 
INFO[0027] [etcd] Successfully started etcd plane.. Checking etcd cluster health 
INFO[0028] [etcd] etcd host [192.168.56.20] reported healthy=true 
INFO[0028] [controlplane] Building up Controller Plane.. 
INFO[0028] Finding container [service-sidekick] on host [192.168.56.20], try #1 
INFO[0028] Image [rancher/rke-tools:v0.1.96] exists on host [192.168.56.20] 
INFO[0028] Image [rancher/hyperkube:v1.28.8-rancher1] exists on host [192.168.56.20] 
WARN[0028] Failed to create Docker container [kube-apiserver] on host [192.168.56.20]: Error response from daemon: invalid environment variable: 
WARN[0028] Failed to create Docker container [kube-apiserver] on host [192.168.56.20]: Error response from daemon: invalid environment variable: 
WARN[0028] Failed to create Docker container [kube-apiserver] on host [192.168.56.20]: Error response from daemon: invalid environment variable: 
FATA[0028] [controlPlane] Failed to bring up Control Plane: [Failed to create [kube-apiserver] container on host [192.168.56.20]: Failed to create Docker container [kube-apiserver] on host [192.168.56.20]: Error response from daemon: invalid environment variable:] 
@tmsdce
Copy link
Author

tmsdce commented May 27, 2024

This bug prevents us from upgrading as kube-apiserver has environment variables set (like the RKE_AUDITLOG_CONFIG_CHECKSUM variable).

@tmsdce
Copy link
Author

tmsdce commented May 30, 2024

Hi @jiaqiluo
Can you take a look at this issue as it seems related to 2e767c8

@jiaqiluo
Copy link
Member

hi @tmsdce, thank you for reporting the issue. I can confirm that this is a bug.

Root Cause

This bug occurs in RKE v1.5.9.

If extra_env is set for kube-api as below in the cluster config file, the Env list in the container configuration for the kube-apiserver container will contain an empty string (""), which causes the creation to fail with the error invalid environment variable: returned from Docker.

services:
  kube-api:
    extra_env:
    - TEST_VAR=test

Workaround

If your cluster is stuck in the failed status, remove the extra_env for kube-api from the cluster config and run rke up again can bring the cluster back to active.

If you need to set the extra_env for kube-api, please use RKE v1.5.8 for now until the bug is fixed in the coming release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants