Controller-manager set to use cluster+host DNS results in non-functional cluster #1039

kh34 · 2019-02-27T16:38:44Z

Running self-hosted Kubernetes on AWS ec2 with cloud-provider=aws enabled.

The controller manager never starts properly:

F0226 20:18:15.928417       1 controllermanager.go:201] error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": error finding instance i-054e7d2613fd49b83: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout\""

Which I believe is due to a cyclical dependency with coredns:

$ kubectl get endpoints -n kube-system   coredns
NAME      ENDPOINTS   AGE
coredns   <none>      20m

Coredns endpoint cannot be created without Controller-manager starting up, and Controller-manager cannot start without coredns endpoint.

I've tested out reverting this #629 and it does seem to resolve the issue. Please consider reverting this change or providing a workaround.

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1967.6.0
VERSION_ID=1967.6.0
BUILD_ID=2019-02-12-2138
PRETTY_NAME="Container Linux by CoreOS 1967.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

The text was updated successfully, but these errors were encountered:

smekkley · 2019-02-27T21:31:27Z

CoreOS and Bootkube are pratically deprecated. You should migrate to kubeadm. It's production ready and supports multi-master as well.
As a workaround did you already try to change DNS after igniting with coredns?

aaronlevy · 2019-03-06T21:04:26Z

I feel like that is a bit of an unfair characterization, but to be fair - I have been less active here (although I did finally get around to merging 1.13 manifests a couple days ago). However, I do largely rely on folks in the community to help on support (and thanks!).

Also, Bootkube is meant as a fairly simple bootstrapping tool - it is not meant to be a canonical source for production manifests. The manifests contained in this repo are for testing and demonstration purposes. For an example of a project that is more fully-featured / launches production clusters (and uses Bootkube) see https://typhoon.psdn.io

As far as Container Linux support, see https://coreos.com/blog/fedora-coreos-red-hat-coreos-and-future-container-linux for some more details.

Regarding your original question @kh34: It would seem odd that a non-cluster domain could not be resolved unless CoreDNS was running. It has been a while since I've looked, but the search domain that ends up in /etc/resolv.conf (I thought) should only point at the cluster resolver for in-cluster domains?

If someone has a chance to dig further into this - happy to take a look / merge PR if it's a more preferred direction (but also not seeing this same behavior on our CI clusters which are launched in AWS).

dghubble · 2019-03-07T08:32:07Z

Checking on this, bootkube examples generate controller-manager with dnsPolicy: ClusterFirstWithHostNet since #629, so the pod has an /etc/resolv.conf like:

nameserver 10.3.0.10
search kube-system.svc.cluster.local svc.cluster.local cluster.local us-east-2.compute.internal
options ndots:5

That's fine for default cloud agnostic clusters. Adding --cloud-provider=aws calls AWS-specific code, which appears to require the AWS resolver, I think your description is right. For that, you probably prefer dnsPolicy: Default, which means you'll have an /etc/resolv.conf:

nameserver 10.0.0.2
search us-east-2.compute.internal

After that tweak, I should caution that, --cloud-provider=aws is not a simple set it and done kind of option. The in-tree AWS code has specific assumptions about how hosts are configured with IAM roles and AWS tags. Other in-tree cloud providers have their own assumptions. I don't see bootkube aiming to satisfy all those host provisioning concerns and more generally, projects like CCM (and CSI to a degree) aim to move functionality out of in-tree cloud-providers.

anguslees · 2019-03-08T00:41:20Z

You should migrate to kubeadm

Aside: kubeadm has declared they are not interested in supporting a number of use cases (the linked example was control plane on arm). Unless/until the kubeadm subproject shifts to being all-inclusive then I don't feel that "drop all other installers" is a reasonable proposition for the kubernetes project as a whole.

fejta-bot · 2019-06-06T01:19:40Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-07-06T02:08:13Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-08-05T02:52:23Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-08-05T02:52:30Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 6, 2019

k8s-ci-robot closed this as completed Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller-manager set to use cluster+host DNS results in non-functional cluster #1039

Controller-manager set to use cluster+host DNS results in non-functional cluster #1039

kh34 commented Feb 27, 2019

smekkley commented Feb 27, 2019

aaronlevy commented Mar 6, 2019

dghubble commented Mar 7, 2019 •

edited

anguslees commented Mar 8, 2019 •

edited

fejta-bot commented Jun 6, 2019

fejta-bot commented Jul 6, 2019

fejta-bot commented Aug 5, 2019

k8s-ci-robot commented Aug 5, 2019

Controller-manager set to use cluster+host DNS results in non-functional cluster #1039

Controller-manager set to use cluster+host DNS results in non-functional cluster #1039

Comments

kh34 commented Feb 27, 2019

smekkley commented Feb 27, 2019

aaronlevy commented Mar 6, 2019

dghubble commented Mar 7, 2019 • edited

anguslees commented Mar 8, 2019 • edited

fejta-bot commented Jun 6, 2019

fejta-bot commented Jul 6, 2019

fejta-bot commented Aug 5, 2019

k8s-ci-robot commented Aug 5, 2019

dghubble commented Mar 7, 2019 •

edited

anguslees commented Mar 8, 2019 •

edited