OpenStack Cloud Provider Initialization Failure Due to DNSPolicy in DaemonSet Template #10914

kolovo · 2024-02-12T11:18:30Z

What happened?

When deploying Kubernetes using Kubespray with OpenStack as the external cloud provider, the cloud provider initialization fails with the following error:
W0212 09:05:21.997886 1 openstack.go:173] New openstack client created failed with config: Post "https://<redacted>:5000/v3/auth/tokens": dial tcp: lookup <redacted> on 10.233.0.3:53: write udp 10.233.0.3:48927->10.233.0.3:53: write: operation not permitted F0212 09:05:21.998071 1 main.go:84] Cloud provider could not be initialized: could not init cloud provider "openstack": Post "https://<redacted>:5000/v3/auth/tokens": dial tcp: lookup <redacted> on 10.233.0.3:53: write udp 10.233.0.3:48927->10.233.0.3:53: write: operation not permitted

This issue appears to be related to DNS resolution failures when the OpenStack cloud provider attempts to authenticate with the OpenStack API. The problem is linked to the DNS policy configuration introduced in commit c440106 (link to the commit) in the external-openstack-cloud-controller-manager-ds.yml.j2 template (direct link to the affected line). The CoreDNS pod cannot start due to the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint, which in turn causes the OpenStack cloud controller to fail initialization.

Notably, this DNS policy setting does not align with the default configurations provided by the official OpenStack cloud provider repository, both in the Helm chart (link to chart) and the plain manifests (link to plain manifest).

What did you expect to happen?

I expected the OpenStack cloud provider to initialize successfully without DNS resolution issues. The official configurations from the OpenStack cloud provider repository do not specify a DNSpolicy, allowing pods to inherit DNS settings from the host, which seems to avoid such initialization problems.

How can we reproduce it (as minimally and precisely as possible)?

Deploy a Kubernetes cluster using Kubespray with the OpenStack external cloud provider enabled.

Observe the failure of the OpenStack cloud controller manager to start, with logs indicating DNS resolution failures similar to the ones provided above.

Note the presence of the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint preventing CoreDNS from starting, which is crucial for DNS resolution by the cloud controller.

OS

uname -srm

Linux 5.15.0-69-generic x86_64

cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Version of Ansible

ansible [core 2.15.9]
config file = /home/ansible/kubespray_2240/ansible.cfg
configured module search path = ['/home/ansible/kubespray_2240/library']
ansible python module location = /home/ansible/python_venvs/kubespray_2231/lib/python3.10/site-packages/ansible
ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections
executable location = /home/ansible/python_venvs/kubespray_2231/bin/ansible
python version = 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] (/home/ansible/python_venvs/kubespray_2231/bin/python3)
jinja version = 3.1.2
libyaml = True

Version of Python

Python 3.10.13

Version of Kubespray (commit)

64447e7

Network plugin used

cilium

Full inventory with variables

### addons.yml ###
metrics_server_enabled: true
metrics_server_replicas: 3
ingress_nginx_enabled: false

### etcd.yml ###
etcd_deployment_type: kubeadm

### k8s-cluster.yml ###
kube_version: v1.28.6
kube_network_plugin: cilium
enable_nodelocaldns: false
kubeconfig_localhost: true
supplementary_addresses_in_ssl_keys: ["redacted"]
# kube_proxy_remove: false

### all.yml ###
cloud_provider: external
external_cloud_provider: openstack

### openstack.yml ###
cinder_csi_enabled: true
cinder_topology: true
cinder_csi_ignore_volume_az: true

# kube_feature_gates:
# - CSIMigration=true
# - CSIMigrationOpenStack=true
# - ExpandCSIVolumes=true

external_openstack_lbaas_enabled: true
external_openstack_lbaas_floating_network_id: "88fbc66b-4946-469c-9848-8725d5014682"
#external_openstack_lbaas_floating_subnet_id: "Neutron subnet ID to get floating IP from"
external_openstack_lbaas_method: ROUND_ROBIN
external_openstack_lbaas_provider: amphora
external_openstack_lbaas_subnet_id: "6cf12127-41c1-4753-b61b-18a7d0098bf4"
#external_openstack_lbaas_network_id: "c896b852-21f4-472e-8dc4-fb3bf62b96bc"
external_openstack_lbaas_manage_security_groups: false
external_openstack_lbaas_create_monitor: true
external_openstack_lbaas_monitor_delay: '5s'
external_openstack_lbaas_monitor_max_retries: 1
external_openstack_lbaas_monitor_timeout: '3s'
external_openstack_lbaas_internal_lb: false

override_system_hostname: false

### k8s-net-cilium.yml ###
cilium_version: "v1.13.3"
cilium_cpu_limit: 1000m
cilium_memory_limit: 2000M
cilium_cpu_requests: 500m
cilium_memory_requests: 500M
cilium_enable_hubble: true
cilium_enable_hubble_metrics: true
cilium_hubble_metrics:
- dns
- drop
- tcp
- flow
- icmp
- http
cilium_hubble_install: true
cilium_hubble_tls_generate: true

inventory not modified used the default provided : contrib/terraform/terraform.py

Command used to invoke ansible

ansible-playbook cluster.yml --become -i inventory/$K8S_CLUSTER_NAME/tf_state_kubespray.py -e @inventory/$K8S_CLUSTER_NAME/$K8S_CLUSTER_NAME.yaml -e @inventory/$K8S_CLUSTER_NAME/no_floating.yml -e "ansible_ssh_private_key_file=/home/ansible/keys/generic_vm_id_rsa" -e external_openstack_lbaas_floating_network_id=$KUBESPRAY_FLOATING_NETWORK_ID -e external_openstack_lbaas_subnet_id=$KUBESPRAY_PRIVATE_SUBNET_ID

Output of ansible run

PLAY RECAP *************************************************************************************************************************
localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
phnx-demo1-k8s-bastion-1 : ok=6 changed=1 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0
phnx-demo1-k8s-k8s-master-1 : ok=736 changed=132 unreachable=0 failed=0 skipped=1135 rescued=0 ignored=2
phnx-demo1-k8s-k8s-master-2 : ok=668 changed=115 unreachable=0 failed=0 skipped=1039 rescued=0 ignored=2
phnx-demo1-k8s-k8s-master-3 : ok=670 changed=116 unreachable=0 failed=0 skipped=1037 rescued=0 ignored=2
phnx-demo1-k8s-k8s-node-worker-1 : ok=659 changed=103 unreachable=0 failed=0 skipped=787 rescued=0 ignored=1
phnx-demo1-k8s-k8s-node-worker-2 : ok=659 changed=103 unreachable=0 failed=0 skipped=777 rescued=0 ignored=1
phnx-demo1-k8s-k8s-node-worker-3 : ok=659 changed=103 unreachable=0 failed=0 skipped=777 rescued=0 ignored=1

Monday 12 February 2024 11:17:45 +0000 (0:00:01.315) 0:28:38.640 *******

kubernetes/control-plane : Joining control plane node to the cluster. ------------------------------------------------------ 77.33s
container-engine/containerd : Download_file | Download item ---------------------------------------------------------------- 24.70s
kubernetes/preinstall : Update package management cache (APT) -------------------------------------------------------------- 23.79s
download : Download_container | Download image if required ----------------------------------------------------------------- 23.19s
network_plugin/cilium : Cilium | Wait for pods to run ---------------------------------------------------------------------- 21.59s
container-engine/crictl : Download_file | Download item -------------------------------------------------------------------- 21.21s
container-engine/runc : Download_file | Download item ---------------------------------------------------------------------- 20.93s
container-engine/nerdctl : Download_file | Download item ------------------------------------------------------------------- 20.56s
kubernetes/kubeadm : Join to cluster --------------------------------------------------------------------------------------- 19.61s
container-engine/crictl : Extract_file | Unpacking archive ----------------------------------------------------------------- 19.15s
kubernetes/preinstall : Install packages requirements ---------------------------------------------------------------------- 18.59s
container-engine/nerdctl : Download_file | Validate mirrors ---------------------------------------------------------------- 16.20s
container-engine/nerdctl : Extract_file | Unpacking archive ---------------------------------------------------------------- 14.19s
kubernetes/control-plane : Kubeadm | Initialize first master --------------------------------------------------------------- 13.76s
container-engine/containerd : Download_file | Validate mirrors ------------------------------------------------------------- 13.29s
container-engine/crictl : Download_file | Validate mirrors ----------------------------------------------------------------- 12.36s
container-engine/runc : Download_file | Validate mirrors ------------------------------------------------------------------- 12.24s
etcdctl_etcdutl : Download_file | Download item ---------------------------------------------------------------------------- 11.08s
network_plugin/cilium : Cilium | Create Cilium node manifests -------------------------------------------------------------- 10.79s
download : Download_container | Download image if required ----------------------------------------------------------------- 10.77s

Anything else we need to know

Removing the dnsPolicy parameter from the external-openstack-cloud-controller-manager-ds.yml.j2 template allows the OpenStack cloud controller pod to resolve DNS queries using the host's DNS settings. This change resolves the issue and allows the OpenStack cloud controller manager to start without errors.

It may be beneficial to align Kubespray's configuration with the official OpenStack cloud provider templates by not specifying a DNSpolicy unless necessary, to prevent such issues from occurring in deployments.

The text was updated successfully, but these errors were encountered:

Payback159 · 2024-02-23T22:53:02Z

Hello @kolovo ,

can you also post the values for upstream_dns_servers and resolvconf_mode?

We have set resolvconf_mode: host_resolvconf in our cluster and configured additional upstream servers and cluster provisioning works without any problems.

Maybe I can recreate your setup and find the cause.

kolovo · 2024-02-24T10:32:10Z

Hello @Payback159

Thank you for your reply. I'll share the information soon.
However I'm not overriding any other parameters except from the above mentioned.
If there is extra configuration needed to be applied in order to make it work with your changes it should be documented.
Before these changes, or if i remove them manually or even use directly the official manifests everything works fine.
Nevertheless, I suggest changes like this one be made in the upstream OpenStack Cloud Controller repository (link) directly, as it's the primary source. The modifications made in the Kubespray repo for the OpenStack Cloud Controller Manager should align with the official repository to maintain consistency.
Regards

tico88612 · 2024-05-07T02:03:28Z

I have the same problem and agree with @kolovo.
This should be consistent with the OpenStack Cloud Controller Manager official repository settings.

UPD: #kubespray-dev discussion

kolovo added the kind/bug Categorizes issue or PR as related to a bug. label Feb 12, 2024

anders-elastisys mentioned this issue Feb 16, 2024

OpenStack Cloud provider init failure on new clusters v2.24.0 elastisys/compliantkubernetes-kubespray#350

Closed

anders-elastisys mentioned this issue May 2, 2024

Release Proposal v2.25 #11126

Open

Payback159 mentioned this issue May 7, 2024

Revert OCCM standard dnsPolicy to ClusterFirst and make dnsPolicy con… #11168

Merged

k8s-ci-robot closed this as completed in #11168 May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenStack Cloud Provider Initialization Failure Due to DNSPolicy in DaemonSet Template #10914

OpenStack Cloud Provider Initialization Failure Due to DNSPolicy in DaemonSet Template #10914

kolovo commented Feb 12, 2024 •

edited

Monday 12 February 2024 11:17:45 +0000 (0:00:01.315) 0:28:38.640 *******

Payback159 commented Feb 23, 2024

kolovo commented Feb 24, 2024 •

edited

tico88612 commented May 7, 2024 •

edited

OpenStack Cloud Provider Initialization Failure Due to DNSPolicy in DaemonSet Template #10914

OpenStack Cloud Provider Initialization Failure Due to DNSPolicy in DaemonSet Template #10914

Comments

kolovo commented Feb 12, 2024 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Monday 12 February 2024 11:17:45 +0000 (0:00:01.315) 0:28:38.640 *******

Anything else we need to know

Payback159 commented Feb 23, 2024

kolovo commented Feb 24, 2024 • edited

tico88612 commented May 7, 2024 • edited

kolovo commented Feb 12, 2024 •

edited

kolovo commented Feb 24, 2024 •

edited

tico88612 commented May 7, 2024 •

edited