conjure-up hangs 'Waiting for deployment to settle' on kubernetes #1377
Comments
I appear to get the same on Ubuntu 16.04 running Though one time after ~20 minutes or so it erred with When running via interface I saw this problem: #1134 Giving up for now. |
i subsequently discovered if one has 'snap install juju' its broken, since it needs to use the juju included w/ conjure-up. |
@donbowman I think the recent 2.3.6 snap juju had issues with downloading its agent. Normally, if snap juju is installed prior to conjure-up, conjure-up will use that instead. But, yes, juju and jujud are both packaged within conjure-up, so it wouldn't have been affected by the recent 2.3.6 and simplestreams issue. |
Yeah I used the included juju. I did:
Ubuntu 16.04 |
I am having the same issue as donbowman on the released version of Ubuntu 18.04. Has anyone figured how to make conjure-up work on localhost? I even enabled sudo with no password, but it still gets stuck at "Waiting for deployment to settle." |
I am facing the same issue. I have tried Ubuntu 18.04 as well as 16.04. The master node just sits there for ever in 'Waiting for deployment to settle'. 2018-05-09 17:43:47,707 [DEBUG] conjure-up/canonical-kubernetes - events.py:53 - Received DeploymentComplete at conjureup/controllers/deploy/common.py:30 in task _wait_for_applications at conjureup/controllers/deploy/gui.py:82 |
What does |
Model Controller Cloud/Region Version SLA App Version Status Scale Charm Store Rev OS Notes Unit Workload Agent Machine Public address Ports Message Machine State DNS Inst id Series AZ Message Relation provider Requirer Interface Type Message |
@Cynerva these issues seem to be around waiting for pods to start |
@sanjeevshar Can you run the cdk-field-agent script and attach the archive it creates? https://github.com/juju-solutions/cdk-field-agent |
@Cynerva I am having trouble running the python script. It is constantly throwing error msg: I looked for the file controllers.yaml. It is there but with only root has permission for it. Not sure how that happened. test@dsib2041:~/.local/share/juju$ ls -l Conjure-up kubernetes has failed and last error msg in log file is: 2018-05-10 10:26:31,295 [INFO] conjure-up/canonical-kubernetes - common.py:36 - Waiting for deployment to settle. ERROR:root:juju status --format=json failed: 1 |
Huh, that's weird. No idea what would cause the permissions on controllers.yaml to change like that. Can you try changing the owner back and running the cdk-field-agent script again?
If the permission changes back to root:root again, let us know and we can try to figure out how to reproduce it. Thanks. |
@Cynerva Tar file is too big (33MB) . I will clean out the system try again so that logs are not that big. |
@Cynerva : Please find tarball of logs |
Thanks @sanjeevshar. The relevant error comes from
All of the pods are hitting this fatal error, which is why kubernetes-master is stuck on The error originates from dockerd on the kubernetes-worker units trying to reach storage.googleapis.com. All three kubernetes-worker units are hitting this error, so it seems to a problem with the environment. @sanjeevshar Can you think of anything in your environment that could be intercepting docker's traffic to
and on one of the kubernetes-worker units:
|
Thank you @Cynerva for taking time to look into it. It could be the corporate firewall. Let me check with IT and get back to you. I have a couple of questions though:
Thanks, |
curl with -k options seems to work but not otherwise
|
@sanjeevshar This seems to indicate an issue between your corporate firewall. Can you verify with IT on getting necessary access? |
The conjure-up spell is waiting for all of the units to indicate that they're "ready" (workload status is "active" and agent status is "idle"). The kubernetes-master unit never enters an "active" state because the pods aren't coming up. I realize this isn't a very good signal to you, the user; ideally, something would tell you, directly, that the deployment has failed. But the variety of underlying issues we've seen is so large that we can't really account for all of them. Some of those issues are temporary and resolve themselves, while others are permanent and require user intervention. In short: kubernetes-master can't tell the difference between "the pods aren't up yet" and "the pods will never come up."
Yes, but it's not easy. You will need to host a docker registry, upload the images to it, and tweak several configurations to make Kubernetes use your registry instead of k8s.gcr.io. The best I can do is refer you to our Running CDK in a restricted environment document. |
@sanjeevshar Thanks for the curl output. The
|
Hi, I'm having the same problem. I did a
|
@Cynerva I'm having the same issue but it doesn't seem to be firewall related 'coz curl above worked. Master is 'waiting on pods to start'. Here is my Any help to point me in the right direction would be appreciated. Model Controller Cloud/Region Version SLA App Version Status Scale Charm Store Rev OS Notes Unit Workload Agent Machine Public address Ports Message Machine State DNS Inst id Series AZ Message Relation provider Requirer Interface Type Message |
@vadoverde Calico/canal doesn't work with the localhost/LXD cloud because it uses and requires privileged Docker containers, which do not work in LXD. I'm pretty sure that's the underlying cause of your issue. In particular, I expect the calico-node service is failing to start. You should be able to confirm it with the following command (assuming I didn't mistype something):
Edit: Actually, I can confirm this from your attached debug log (thanks for that btw):
@battlemidget Is there a way for us to make it where calico/canal isn't selectable in the spell when deploying to localhost? |
@luisenrike Most likely the error you're seeing indicates that kube-apiserver isn't up. You can check by doing Alternatively, feel free to run cdk-field-agent and attach the archive it creates since that includes more comprehensive info that will help us get to the root of this quicker. |
Yea ill look into that |
@Cynerva : I finally got the firewall issue resolved but kubernetes-master/0 is still in "Waiting for kube-system pods to start". 2018-06-23 23:07:16 INFO juju-log kubedns not ready yet 2018-06-23 23:07:16 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:08:01 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:09:29 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:12:12 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:17:14 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:22:17 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:27:19 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:32:22 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:37:26 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:42:30 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:47:32 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:52:38 DEBUG config-changed Traceback (most recent call last): 2018-06-23 23:57:49 DEBUG config-changed Traceback (most recent call last): |
Juju Status: test@dsib0165:~$ juju status App Version Status Scale Charm Store Rev OS Notes Unit Workload Agent Machine Public address Ports Message Machine State DNS Inst id Series AZ Message |
@sanjeevshar It looks like you're hitting this issue: #1448 You can try manually applying the workaround in this comment: #1448 (comment) |
Oh, sorry @sanjeevshar. I was going based off the symptoms: the You can try running that workaround after deployment, after the units are up and you start seeing the errors. I'm guessing it won't help in this case though, and I guess conjure-up will call it a failed deployment and bail out by then anyway. Anyway, that error in the debug-log indicates this is related to snaps (the package format we use for etcd and kubernetes services), but we'll have to dig deeper to find the underlying cause. @sanjeevshar can you think of anything noteworthy that's changed since your earlier deployments where you hit the firewall issue? New operating system or anything like that? I ask because those earlier deployments looked to be running the snap services just fine. If you can run cdk-field-agent again and attach the archive, that would help a lot - we would be looking for snap related apparmor errors in there. It would also help if you can run this after a failed deployment, before cleaning it up, and share the output:
|
@Cynerva My master/0 is still in " Waiting for kube-system pods to start" state but here is output of juju run --all sudo aa-status test@dsib0165:~$ juju run --all sudo aa-status
|
cdk-field-agent output. Nothing has changed on my setup. I have even reinstalled Ubuntu 16.04 but no change in behavior. |
Thanks @sanjeevshar. Sorry for the confusion - the debug-log and juju status that you pasted definitely showed a different error relating to snap profiles - must have been a fluke. The newly attached archive lines up more with what you've been saying. Anyway, I still see this error:
which is the same error you were hitting before, so it looks to me like traffic to Can you share the output of this command? I'd like to see what the verbose output says about the certificate:
|
Thank you for your patience @Cynerva . Actually IT had given me a couple of certificates to install on the host which I did and that is how host is able to get to storage.googleapis.com: sha256:da86e6ba6ca197bf6bc5e9d900febd 100%[=========================================================================>] 1.57K --.-KB/s in 0.01s 2018-06-26 09:31:55 (146 KB/s) - âsha256:da86e6ba6ca197bf6bc5e9d900febd906b133eaa4750e6bed647b0fbe50ed43eâ saved [1609/1609] Unfortunately when I run the same command on worker or master nodes, it fails: Is there a workaround for this? Now I have installed certificates on master/0 and worker/0 and they are able to get to storage.googleapis.com but how do restart/continue the installation using same containers. |
@sanjeevshar If you wanted to clone our spells repo and add the certificate stuff in a before-wait hook, you could then deploy from that altered spell. Soon as I get some time I'll provide you with an example spell on what it would look like |
@battlemidget Whenever you get time please post an example. It kind of sounds strange that the cluster deployment in on private cloud environment requires node to directly to storage.googleapis.com to fetch files. I think modes should get everything from the host that it needs for installation. |
@sanjeevshar I believe we're working on an offline deployment that may solve this issue, @Cynerva is this correct? @sanjeevshar Please see the following docs for summoning spells from local directory: https://docs.conjure-up.io/devel/en/usage#github-and-bitbucket You'll want to git clone https://github.com/conjure-up/spells as well. |
The best we have is a documented path for installing CDK in a network-restricted environment, without conjure-up: https://github.com/juju-solutions/bundle-canonical-kubernetes/wiki/Running-CDK-in-a-restricted-environment Those instructions include setting up your own docker registry with the necessary images, and configuring Kubernetes to use it. That would take the direct calls to storage.googleapis.com out of the picture. So it's doable, but not straightforward. I don't think that's going to change any time soon. |
I just had this same problem with identical symptoms. I provisioned a 4CPU/16GB AWS EC2 instance and ran conjure-up to install Kubernetes. The process hung with the master nodes waiting to start. The CloudWatch monitor showed the CPUs pegged at 85% the entire time, until I killed the lxc containers. I killed that instance, and started an 8CPU/32GB instance. As expected, the process went MUCH faster. The deployment was successful. CPU utilization peaked briefly at 76%, then fell to about 28% after the deployment completed. Last I checked, the CPUs are sitting at 8% utilization with the cluster running. Of course my experience is likely just one of many potential causes for the symptoms reported in this thread. |
This problem seems to go away if you choose "dir" as the storage mechanism for lxd when you do "lxd init". This is as per documentation at conjureup.io |
This problem can be solved by lots of CPU cores and Memory. I started by 4CPU/4GB then Increasing to 32CPU/16GB solved it and made installation so fast. |
Report
Ubuntu 18.04, using localhost/lxd. juju appears to work (e.g. I can setup 'ghost' and access it). But conjure-up just hangs as below (logs et al attached).
lxc is configured using local lvm as storage and local bridge for network.
It doesn't put anything interesting in the logs. It doesn't seem to even start a single container.
(the controller pre-exists, i bootstrapped it w/ juju, it works for e.g. seting up ghost).
sudo snap refresh conjure-up --edge
?Please provide the output of the following commands
Please attach tarball of ~/.cache/conjure-up:
Sosreport
Please attach a sosreport:
The resulting output file can be attached to this issue.
What Spell was Selected?
kubernetes-core
What provider (aws, maas, localhost, etc)?
localhost
MAAS Users
Which version of MAAS?
Commands ran
Please outline what commands were run to install and execute conjure-up:
conjure-up.log
sosreport-DBowman-20180322111548.tar.gz
conjure up is from snap
Additional Information
works so its presumably related to kubernetes-core
The text was updated successfully, but these errors were encountered: