Skip to content
This repository has been archived by the owner on Apr 12, 2021. It is now read-only.

Waiting for kube-system pods to start #1430

Open
3 tasks
Ariestattoo opened this issue May 13, 2018 · 15 comments
Open
3 tasks

Waiting for kube-system pods to start #1430

Ariestattoo opened this issue May 13, 2018 · 15 comments
Assignees

Comments

@Ariestattoo
Copy link

Ariestattoo commented May 13, 2018

Report

screenshot from 2018-05-13 14-43-14
image

I realize that this is a non specific error, but I am not sure where I can further investigate at this point. Any suggestions or insight is very much appreciated.

Thank you for trying conjure-up! Before reporting a bug please make sure you've gone through this checklist:

Please provide the output of the following commands

which juju  /snap/bin/juju
juju version 2.3.7-bionic-amd64

which conjure-up  /snap/bin/conjure-up
conjure-up --version  conjure-up 2.5.6

which lxc  /snap/bin/lxc

/snap/bin/lxc config show
config:
  core.https_address: '[::]:8443'
  core.trust_password: true

/snap/bin/lxc version
Client version: 3.0.0
Server version: 3.0.0

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04 LTS"

Please attach tarball of ~/.cache/conjure-up:
conjure-up.tar.gz

Sosreport

200MB zip file 30 MB xz

What Spell was Selected?

kubernetes-canonical

What provider (aws, maas, localhost, etc)?

localhost

MAAS Users

Which version of MAAS?

Commands ran

conjure-up
Please outline what commands were run to install and execute conjure-up:

Additional Information

cdk-field-agent

@Cynerva
Copy link

Cynerva commented May 14, 2018

Thanks for the cdk-field-agent attachment.

kubectl describe po indicates that pods can't deploy because there are no available nodes:

Warning  FailedScheduling  2m (x20 over 8m)    default-scheduler  0/3 nodes are available: 3 node(s) were not ready.

kubectl describe nodes indicates that kubelet is restarting repeatedly:

...
  Normal  Starting                 10s   kubelet, juju-74c113-8     Starting kubelet.
  Normal  NodeHasSufficientDisk    9s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  9s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientMemory
  Normal  Starting                 6s    kubelet, juju-74c113-8     Starting kubelet.
  Normal  NodeHasSufficientDisk    6s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientPID     6s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientPID
  Normal  NodeHasNoDiskPressure    6s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientMemory  6s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientMemory
  Normal  NodeHasSufficientPID     3s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientPID
  Normal  Starting                 3s    kubelet, juju-74c113-8     Starting kubelet.
  Normal  NodeHasSufficientDisk    3s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  3s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    3s    kubelet, juju-74c113-8     Node juju-74c113-8 status is now: NodeHasNoDiskPressure
...

journalctl -o cat -u snap.kubelet.daemon shows this fatal error occurring repeatedly:

kubelet.daemon[5557]: F0513 21:08:01.329603    5557 kubelet.go:1354] Failed to start ContainerManager failed to get rootfs info: cannot find filesystem info for device "default/containers/juju-74c113-6"
systemd[1]: snap.kubelet.daemon.service: Main process exited, code=exited, status=255/n/a

@Ariestattoo I believe this is an issue we've seen before when installing to localhost/LXD when the storage backend is ZFS. You might be able to work around this by running lxd init, and when prompted to select a storage backend, type "dir" instead of letting it default to "zfs"

I will follow up on two points here:

  1. Why does kubernetes-worker status not show kubelet is failing?
  2. Why, exactly, does kubelet fail on ZFS-backed LXD? Can we make it work?

@Ariestattoo
Copy link
Author

Ariestattoo commented May 14, 2018

@Cynerva Thanks for the great insight!!
I am running my zfs on a dedicated block device, but I am going to create another one and try btrfs and then file if that doesn't work. I will detail my results.
TY

@Ariestattoo
Copy link
Author

So I tried creating a new block device using btrfs and dir...both failed.
results ->
BTRFS CDK
DIR CDK

@Cynerva
Copy link

Cynerva commented May 15, 2018

Thanks. Taking a quick glance in the DIR archive, it's hitting the same error:

kubelet.daemon[4731]: F0515 15:51:14.982974    4731 kubelet.go:1354] Failed to start ContainerManager failed to get rootfs info: cannot find filesystem info for device "default/containers/juju-bb2fd1-1"

I'm guessing either LXD didn't actually stop using ZFS, or I misdiagnosed the issue and it's not ZFS related after all. The default/containers/juju-bb2fd1-1 device name looks like ZFS to me though.

I don't have anything helpful to offer right now, and lots of work to juggle so it'll be a few days before I can come back to this. Thanks again for the detailed report and sorry for the trouble.

@Ariestattoo
Copy link
Author

I appreciate the time constraints...doing this on my lunch break. Couple questions if I might.

  1. I am going to narrow the scope of my build and work with an individual charm. Can yo recommend a single charm that I could use to try and debug the deployment?
  2. If you had to guess. Is this a Linux | File System| LXD | JUJU | Kubernetes issue?
    Thanks for your time @Cynerva

@Cynerva
Copy link

Cynerva commented May 15, 2018

Can yo recommend a single charm that I could use to try and debug the deployment?

Afraid not. The fatal error is coming from the kubelet service on the kubernetes-worker units, but you're gonna need the rest of the cluster (easyrsa, etcd, kubernetes-master, flannel) for kubernetes-worker to get far enough to start kubelet.

If you had to guess. Is this a Linux | File System| LXD | JUJU | Kubernetes issue?

Either there's a bug in kubelet (one of the Kubernetes core services), or kubelet is missing a dependency that it needs. I'm guessing the latter, which would make it a bug in the kubernetes-worker charm.

@sumlin
Copy link

sumlin commented May 16, 2018

@Ariestattoo local deployment doesn't work now #1426 . In result I'd like to suggest to setup your cluster on Ubuntu 16.04 manually.

@adam-stokes
Copy link
Contributor

It does work, you're using btrfs in your linked bug. This problem seems to be zfs is still being used.

@Ariestattoo
Copy link
Author

Ariestattoo commented May 16, 2018

So I used LXC to create these pools and then selected the relevant choice when using conjure-up. Is that incorrect? My default pool is an XFS pool based on a block device. Where do you see my configuration error? I have several existing controllers and models already created and being used in the XFS pool. Are you suggesting I run LXD init again instead of using LXC to manually create and select?

@sumlin
Copy link

sumlin commented May 16, 2018

@battlemidget I've tried both btrfs and ZFS, this information is in the ticket.

@adam-stokes
Copy link
Contributor

@sumlin Yea, what I'm saying is don't use those for now (at least until we can figure out why those are giving us trouble) and stick with dir as your storage backend for LXD.

@sumlin
Copy link

sumlin commented May 22, 2018

@battlemidget oh, thank you, I will.

@jzoldak
Copy link

jzoldak commented May 24, 2018

FYI @battlemidget @sumlin @Ariestattoo I was running into the same issue, and after running sudo lxd init and changing from zfs to dir it did get past this point and finished the conjure-up of the kubernetes-canonical spell.

@adam-stokes
Copy link
Contributor

adam-stokes commented May 24, 2018

FYI @battlemidget @sumlin @Ariestattoo I was running into the same issue, and after running sudo lxd init and changing from zfs to dir it did get past this point and finished the conjure-up of the kubernetes-canonical spell.

Thanks for the feedback, the kubernetes guys know there is something going wrong when using a storage backend other than dir and are working to try and track down the root cause.

@Cynerva btrfs tends to be the default if you don't have the zfs utils package installed. I think we should talk to lxd guys as well to see if btrfs is the right choice as a default in these cases.

@countbytedown
Copy link

countbytedown commented Apr 2, 2019

This resolved it for me.

Feel free to close

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants