Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenStack Cloud Provider Initialization Failure Due to DNSPolicy in DaemonSet Template #10914

Closed
kolovo opened this issue Feb 12, 2024 · 3 comments · Fixed by #11168
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kolovo
Copy link

kolovo commented Feb 12, 2024

What happened?

When deploying Kubernetes using Kubespray with OpenStack as the external cloud provider, the cloud provider initialization fails with the following error:
W0212 09:05:21.997886 1 openstack.go:173] New openstack client created failed with config: Post "https://<redacted>:5000/v3/auth/tokens": dial tcp: lookup <redacted> on 10.233.0.3:53: write udp 10.233.0.3:48927->10.233.0.3:53: write: operation not permitted F0212 09:05:21.998071 1 main.go:84] Cloud provider could not be initialized: could not init cloud provider "openstack": Post "https://<redacted>:5000/v3/auth/tokens": dial tcp: lookup <redacted> on 10.233.0.3:53: write udp 10.233.0.3:48927->10.233.0.3:53: write: operation not permitted

This issue appears to be related to DNS resolution failures when the OpenStack cloud provider attempts to authenticate with the OpenStack API. The problem is linked to the DNS policy configuration introduced in commit c440106 (link to the commit) in the external-openstack-cloud-controller-manager-ds.yml.j2 template (direct link to the affected line). The CoreDNS pod cannot start due to the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint, which in turn causes the OpenStack cloud controller to fail initialization.

Notably, this DNS policy setting does not align with the default configurations provided by the official OpenStack cloud provider repository, both in the Helm chart (link to chart) and the plain manifests (link to plain manifest).

What did you expect to happen?

I expected the OpenStack cloud provider to initialize successfully without DNS resolution issues. The official configurations from the OpenStack cloud provider repository do not specify a DNSpolicy, allowing pods to inherit DNS settings from the host, which seems to avoid such initialization problems.

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy a Kubernetes cluster using Kubespray with the OpenStack external cloud provider enabled.
    
  2. Observe the failure of the OpenStack cloud controller manager to start, with logs indicating DNS resolution failures similar to the ones provided above.
    
  3. Note the presence of the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint preventing CoreDNS from starting, which is crucial for DNS resolution by the cloud controller.
    

OS

uname -srm

Linux 5.15.0-69-generic x86_64

cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Version of Ansible

ansible [core 2.15.9]
config file = /home/ansible/kubespray_2240/ansible.cfg
configured module search path = ['/home/ansible/kubespray_2240/library']
ansible python module location = /home/ansible/python_venvs/kubespray_2231/lib/python3.10/site-packages/ansible
ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections
executable location = /home/ansible/python_venvs/kubespray_2231/bin/ansible
python version = 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] (/home/ansible/python_venvs/kubespray_2231/bin/python3)
jinja version = 3.1.2
libyaml = True

Version of Python

Python 3.10.13

Version of Kubespray (commit)

64447e7

Network plugin used

cilium

Full inventory with variables

### addons.yml ###
metrics_server_enabled: true
metrics_server_replicas: 3
ingress_nginx_enabled: false

### etcd.yml ###
etcd_deployment_type: kubeadm

### k8s-cluster.yml ###
kube_version: v1.28.6
kube_network_plugin: cilium
enable_nodelocaldns: false
kubeconfig_localhost: true
supplementary_addresses_in_ssl_keys: ["redacted"]
# kube_proxy_remove: false

### all.yml ###
cloud_provider: external
external_cloud_provider: openstack

### openstack.yml ###
cinder_csi_enabled: true
cinder_topology: true
cinder_csi_ignore_volume_az: true

# kube_feature_gates:
# - CSIMigration=true
# - CSIMigrationOpenStack=true
# - ExpandCSIVolumes=true

external_openstack_lbaas_enabled: true
external_openstack_lbaas_floating_network_id: "88fbc66b-4946-469c-9848-8725d5014682"
#external_openstack_lbaas_floating_subnet_id: "Neutron subnet ID to get floating IP from"
external_openstack_lbaas_method: ROUND_ROBIN
external_openstack_lbaas_provider: amphora
external_openstack_lbaas_subnet_id: "6cf12127-41c1-4753-b61b-18a7d0098bf4"
#external_openstack_lbaas_network_id: "c896b852-21f4-472e-8dc4-fb3bf62b96bc"
external_openstack_lbaas_manage_security_groups: false
external_openstack_lbaas_create_monitor: true
external_openstack_lbaas_monitor_delay: '5s'
external_openstack_lbaas_monitor_max_retries: 1
external_openstack_lbaas_monitor_timeout: '3s'
external_openstack_lbaas_internal_lb: false

override_system_hostname: false

### k8s-net-cilium.yml ###
cilium_version: "v1.13.3"
cilium_cpu_limit: 1000m
cilium_memory_limit: 2000M
cilium_cpu_requests: 500m
cilium_memory_requests: 500M
cilium_enable_hubble: true
cilium_enable_hubble_metrics: true
cilium_hubble_metrics:
- dns
- drop
- tcp
- flow
- icmp
- http
cilium_hubble_install: true
cilium_hubble_tls_generate: true

inventory not modified used the default provided : contrib/terraform/terraform.py

Command used to invoke ansible

ansible-playbook cluster.yml --become -i inventory/$K8S_CLUSTER_NAME/tf_state_kubespray.py -e @inventory/$K8S_CLUSTER_NAME/$K8S_CLUSTER_NAME.yaml -e @inventory/$K8S_CLUSTER_NAME/no_floating.yml -e "ansible_ssh_private_key_file=/home/ansible/keys/generic_vm_id_rsa" -e external_openstack_lbaas_floating_network_id=$KUBESPRAY_FLOATING_NETWORK_ID -e external_openstack_lbaas_subnet_id=$KUBESPRAY_PRIVATE_SUBNET_ID

Output of ansible run

PLAY RECAP *************************************************************************************************************************
localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
phnx-demo1-k8s-bastion-1 : ok=6 changed=1 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0
phnx-demo1-k8s-k8s-master-1 : ok=736 changed=132 unreachable=0 failed=0 skipped=1135 rescued=0 ignored=2
phnx-demo1-k8s-k8s-master-2 : ok=668 changed=115 unreachable=0 failed=0 skipped=1039 rescued=0 ignored=2
phnx-demo1-k8s-k8s-master-3 : ok=670 changed=116 unreachable=0 failed=0 skipped=1037 rescued=0 ignored=2
phnx-demo1-k8s-k8s-node-worker-1 : ok=659 changed=103 unreachable=0 failed=0 skipped=787 rescued=0 ignored=1
phnx-demo1-k8s-k8s-node-worker-2 : ok=659 changed=103 unreachable=0 failed=0 skipped=777 rescued=0 ignored=1
phnx-demo1-k8s-k8s-node-worker-3 : ok=659 changed=103 unreachable=0 failed=0 skipped=777 rescued=0 ignored=1

Monday 12 February 2024 11:17:45 +0000 (0:00:01.315) 0:28:38.640 *******

kubernetes/control-plane : Joining control plane node to the cluster. ------------------------------------------------------ 77.33s
container-engine/containerd : Download_file | Download item ---------------------------------------------------------------- 24.70s
kubernetes/preinstall : Update package management cache (APT) -------------------------------------------------------------- 23.79s
download : Download_container | Download image if required ----------------------------------------------------------------- 23.19s
network_plugin/cilium : Cilium | Wait for pods to run ---------------------------------------------------------------------- 21.59s
container-engine/crictl : Download_file | Download item -------------------------------------------------------------------- 21.21s
container-engine/runc : Download_file | Download item ---------------------------------------------------------------------- 20.93s
container-engine/nerdctl : Download_file | Download item ------------------------------------------------------------------- 20.56s
kubernetes/kubeadm : Join to cluster --------------------------------------------------------------------------------------- 19.61s
container-engine/crictl : Extract_file | Unpacking archive ----------------------------------------------------------------- 19.15s
kubernetes/preinstall : Install packages requirements ---------------------------------------------------------------------- 18.59s
container-engine/nerdctl : Download_file | Validate mirrors ---------------------------------------------------------------- 16.20s
container-engine/nerdctl : Extract_file | Unpacking archive ---------------------------------------------------------------- 14.19s
kubernetes/control-plane : Kubeadm | Initialize first master --------------------------------------------------------------- 13.76s
container-engine/containerd : Download_file | Validate mirrors ------------------------------------------------------------- 13.29s
container-engine/crictl : Download_file | Validate mirrors ----------------------------------------------------------------- 12.36s
container-engine/runc : Download_file | Validate mirrors ------------------------------------------------------------------- 12.24s
etcdctl_etcdutl : Download_file | Download item ---------------------------------------------------------------------------- 11.08s
network_plugin/cilium : Cilium | Create Cilium node manifests -------------------------------------------------------------- 10.79s
download : Download_container | Download image if required ----------------------------------------------------------------- 10.77s

Anything else we need to know

Removing the dnsPolicy parameter from the external-openstack-cloud-controller-manager-ds.yml.j2 template allows the OpenStack cloud controller pod to resolve DNS queries using the host's DNS settings. This change resolves the issue and allows the OpenStack cloud controller manager to start without errors.

It may be beneficial to align Kubespray's configuration with the official OpenStack cloud provider templates by not specifying a DNSpolicy unless necessary, to prevent such issues from occurring in deployments.

@Payback159
Copy link
Contributor

Hello @kolovo ,

can you also post the values for upstream_dns_servers and resolvconf_mode?

We have set resolvconf_mode: host_resolvconf in our cluster and configured additional upstream servers and cluster provisioning works without any problems.

Maybe I can recreate your setup and find the cause.

@kolovo
Copy link
Author

kolovo commented Feb 24, 2024

Hello @Payback159

Thank you for your reply. I'll share the information soon.
However I'm not overriding any other parameters except from the above mentioned.
If there is extra configuration needed to be applied in order to make it work with your changes it should be documented.
Before these changes, or if i remove them manually or even use directly the official manifests everything works fine.
Nevertheless, I suggest changes like this one be made in the upstream OpenStack Cloud Controller repository (link) directly, as it's the primary source. The modifications made in the Kubespray repo for the OpenStack Cloud Controller Manager should align with the official repository to maintain consistency.
Regards

@tico88612
Copy link
Member

tico88612 commented May 7, 2024

I have the same problem and agree with @kolovo.
This should be consistent with the OpenStack Cloud Controller Manager official repository settings.

UPD: #kubespray-dev discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants