Waiting for kube-system pods to start #1430

Ariestattoo · 2018-05-13T22:02:05Z

Report

I realize that this is a non specific error, but I am not sure where I can further investigate at this point. Any suggestions or insight is very much appreciated.

Thank you for trying conjure-up! Before reporting a bug please make sure you've gone through this checklist:

Is this problem already documented at https://docs.ubuntu.com/conjure-up/en/troubleshoot#common-spell-problems No solution
Is conjure-up running inside a virtual machine? No
Is this problem reproducible with sudo snap refresh conjure-up --edge? yes

Please provide the output of the following commands

which juju  /snap/bin/juju
juju version 2.3.7-bionic-amd64

which conjure-up  /snap/bin/conjure-up
conjure-up --version  conjure-up 2.5.6

which lxc  /snap/bin/lxc

/snap/bin/lxc config show
config:
  core.https_address: '[::]:8443'
  core.trust_password: true

/snap/bin/lxc version
Client version: 3.0.0
Server version: 3.0.0

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04 LTS"

Please attach tarball of ~/.cache/conjure-up:
conjure-up.tar.gz

Sosreport

200MB zip file 30 MB xz

What Spell was Selected?

kubernetes-canonical

What provider (aws, maas, localhost, etc)?

localhost

MAAS Users

Which version of MAAS?

Commands ran

conjure-up
Please outline what commands were run to install and execute conjure-up:

Additional Information

cdk-field-agent

The text was updated successfully, but these errors were encountered:

Cynerva · 2018-05-14T16:19:02Z

Thanks for the cdk-field-agent attachment.

kubectl describe po indicates that pods can't deploy because there are no available nodes:

Warning  FailedScheduling  2m (x20 over 8m)    default-scheduler  0/3 nodes are available: 3 node(s) were not ready.

kubectl describe nodes indicates that kubelet is restarting repeatedly:

...
  Normal  Starting                 10s   kubelet, juju-74c113-8     Starting kubelet.
  Normal  NodeHasSufficientDisk    9s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  9s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientMemory
  Normal  Starting                 6s    kubelet, juju-74c113-8     Starting kubelet.
  Normal  NodeHasSufficientDisk    6s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientPID     6s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientPID
  Normal  NodeHasNoDiskPressure    6s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientMemory  6s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientMemory
  Normal  NodeHasSufficientPID     3s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientPID
  Normal  Starting                 3s    kubelet, juju-74c113-8     Starting kubelet.
  Normal  NodeHasSufficientDisk    3s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  3s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    3s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasNoDiskPressure
...

journalctl -o cat -u snap.kubelet.daemon shows this fatal error occurring repeatedly:

kubelet.daemon[5557]: F0513 21:08:01.329603    5557 kubelet.go:1354] Failed to start ContainerManager failed to get rootfs info: cannot find filesystem info for device "default/containers/juju-74c113-6"
systemd[1]: snap.kubelet.daemon.service: Main process exited, code=exited, status=255/n/a

@Ariestattoo I believe this is an issue we've seen before when installing to localhost/LXD when the storage backend is ZFS. You might be able to work around this by running lxd init, and when prompted to select a storage backend, type "dir" instead of letting it default to "zfs"

I will follow up on two points here:

Why does kubernetes-worker status not show kubelet is failing?
Why, exactly, does kubelet fail on ZFS-backed LXD? Can we make it work?

Ariestattoo · 2018-05-14T17:46:26Z

@Cynerva Thanks for the great insight!!
I am running my zfs on a dedicated block device, but I am going to create another one and try btrfs and then file if that doesn't work. I will detail my results.
TY

Ariestattoo · 2018-05-15T16:17:00Z

So I tried creating a new block device using btrfs and dir...both failed.
results ->
BTRFS CDK
DIR CDK

Cynerva · 2018-05-15T17:02:02Z

Thanks. Taking a quick glance in the DIR archive, it's hitting the same error:

kubelet.daemon[4731]: F0515 15:51:14.982974    4731 kubelet.go:1354] Failed to start ContainerManager failed to get rootfs info: cannot find filesystem info for device "default/containers/juju-bb2fd1-1"

I'm guessing either LXD didn't actually stop using ZFS, or I misdiagnosed the issue and it's not ZFS related after all. The default/containers/juju-bb2fd1-1 device name looks like ZFS to me though.

I don't have anything helpful to offer right now, and lots of work to juggle so it'll be a few days before I can come back to this. Thanks again for the detailed report and sorry for the trouble.

Ariestattoo · 2018-05-15T18:40:26Z

I appreciate the time constraints...doing this on my lunch break. Couple questions if I might.

I am going to narrow the scope of my build and work with an individual charm. Can yo recommend a single charm that I could use to try and debug the deployment?
If you had to guess. Is this a Linux | File System| LXD | JUJU | Kubernetes issue?
Thanks for your time @Cynerva

Cynerva · 2018-05-15T19:17:51Z

Can yo recommend a single charm that I could use to try and debug the deployment?

Afraid not. The fatal error is coming from the kubelet service on the kubernetes-worker units, but you're gonna need the rest of the cluster (easyrsa, etcd, kubernetes-master, flannel) for kubernetes-worker to get far enough to start kubelet.

If you had to guess. Is this a Linux | File System| LXD | JUJU | Kubernetes issue?

Either there's a bug in kubelet (one of the Kubernetes core services), or kubelet is missing a dependency that it needs. I'm guessing the latter, which would make it a bug in the kubernetes-worker charm.

sumlin · 2018-05-16T05:20:42Z

@Ariestattoo local deployment doesn't work now #1426 . In result I'd like to suggest to setup your cluster on Ubuntu 16.04 manually.

adam-stokes · 2018-05-16T15:25:34Z

It does work, you're using btrfs in your linked bug. This problem seems to be zfs is still being used.

Ariestattoo · 2018-05-16T15:42:30Z

So I used LXC to create these pools and then selected the relevant choice when using conjure-up. Is that incorrect? My default pool is an XFS pool based on a block device. Where do you see my configuration error? I have several existing controllers and models already created and being used in the XFS pool. Are you suggesting I run LXD init again instead of using LXC to manually create and select?

sumlin · 2018-05-16T22:32:29Z

@battlemidget I've tried both btrfs and ZFS, this information is in the ticket.

adam-stokes · 2018-05-17T13:21:48Z

@sumlin Yea, what I'm saying is don't use those for now (at least until we can figure out why those are giving us trouble) and stick with dir as your storage backend for LXD.

sumlin · 2018-05-22T14:24:42Z

@battlemidget oh, thank you, I will.

jzoldak · 2018-05-24T16:20:40Z

FYI @battlemidget @sumlin @Ariestattoo I was running into the same issue, and after running sudo lxd init and changing from zfs to dir it did get past this point and finished the conjure-up of the kubernetes-canonical spell.

adam-stokes · 2018-05-24T23:54:26Z

FYI @battlemidget @sumlin @Ariestattoo I was running into the same issue, and after running sudo lxd init and changing from zfs to dir it did get past this point and finished the conjure-up of the kubernetes-canonical spell.

Thanks for the feedback, the kubernetes guys know there is something going wrong when using a storage backend other than dir and are working to try and track down the root cause.

@Cynerva btrfs tends to be the default if you don't have the zfs utils package installed. I think we should talk to lxd guys as well to see if btrfs is the right choice as a default in these cases.

countbytedown · 2019-04-02T23:46:18Z

This resolved it for me.

Feel free to close

adam-stokes mentioned this issue May 14, 2018

conjure-up hangs 'Waiting for deployment to settle' on kubernetes #1377

Open

adam-stokes added the documentation label May 30, 2018

adam-stokes self-assigned this May 30, 2018

This was referenced May 30, 2018

Waiting for kube-system pods to start #1412

Closed

Doc updates canonical/docs.conjure-up.io#39

Merged

aug70 mentioned this issue Jun 6, 2018

Localhost cluster fails after restart #1448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Waiting for kube-system pods to start #1430

Waiting for kube-system pods to start #1430

Ariestattoo commented May 13, 2018 •

edited

Cynerva commented May 14, 2018

Ariestattoo commented May 14, 2018 •

edited

Ariestattoo commented May 15, 2018

Cynerva commented May 15, 2018

Ariestattoo commented May 15, 2018

Cynerva commented May 15, 2018

sumlin commented May 16, 2018

adam-stokes commented May 16, 2018

Ariestattoo commented May 16, 2018 •

edited

sumlin commented May 16, 2018

adam-stokes commented May 17, 2018

sumlin commented May 22, 2018

jzoldak commented May 24, 2018

adam-stokes commented May 24, 2018 •

edited

countbytedown commented Apr 2, 2019 •

edited

Waiting for kube-system pods to start #1430

Waiting for kube-system pods to start #1430

Comments

Ariestattoo commented May 13, 2018 • edited

Report

Please provide the output of the following commands

Sosreport

What Spell was Selected?

What provider (aws, maas, localhost, etc)?

MAAS Users

Commands ran

Additional Information

Cynerva commented May 14, 2018

Ariestattoo commented May 14, 2018 • edited

Ariestattoo commented May 15, 2018

Cynerva commented May 15, 2018

Ariestattoo commented May 15, 2018

Cynerva commented May 15, 2018

sumlin commented May 16, 2018

adam-stokes commented May 16, 2018

Ariestattoo commented May 16, 2018 • edited

sumlin commented May 16, 2018

adam-stokes commented May 17, 2018

sumlin commented May 22, 2018

jzoldak commented May 24, 2018

adam-stokes commented May 24, 2018 • edited

countbytedown commented Apr 2, 2019 • edited

Ariestattoo commented May 13, 2018 •

edited

Ariestattoo commented May 14, 2018 •

edited

Ariestattoo commented May 16, 2018 •

edited

adam-stokes commented May 24, 2018 •

edited

countbytedown commented Apr 2, 2019 •

edited