Kubelet fails to start on worker node - keeps restarting #1421

akodd · 2018-04-29T04:21:38Z

Report

Kubelet fails to start on the worker. Worker periodically restarts kubelet. Master nodes indefinitely waits for pods to start.

See attached field agent reports.

root@juju-4222ff-4:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
juju-4222ff-4 NotReady 18m v1.10.0

Thank you for trying conjure-up! Before reporting a bug please make sure you've gone through this checklist:

Is this problem already documented at https://docs.ubuntu.com/conjure-up/en/troubleshoot#common-spell-problems
no
Is conjure-up running inside a virtual machine?
no
Is this problem reproducible with sudo snap refresh conjure-up --edge?
yes

Please provide the output of the following commands

Tried on current and reporting on edge version.

which juju
/snap/bin/juju
juju version
2.3.7-xenial-amd64

which conjure-up
/snap/bin/conjure-up
conjure-up --version
conjure-up 2.5.6

which lxc
/snap/bin/lxc

/snap/bin/lxc core.https_address: '[::]'
/snap/bin/lxc version
Client version: 3.0.0
Server version: 3.0.0

cat /etc/lsb-release

Please attach tarball of ~/.cache/conjure-up:

tar cvzf conjure-up.tar.gz ~/.cache/conjure-up

conjure-up.tar.gz

Sosreport

Please attach a sosreport:

sudo apt install sosreport
sosreport

The resulting output file can be attached to this issue.

What Spell was Selected?

canonical-kubernetes

What provider (aws, maas, localhost, etc)?

locahost

MAAS Users

Which version of MAAS?
not mass

Commands ran

conjure-up

Please outline what commands were run to install and execute conjure-up:

Additional Information

Output from cdk-field-agent

results-2018-04-28-23-20-23.tar.gz

results-2018-04-29-00-15-23.tar.gz

The text was updated successfully, but these errors were encountered:

stale · 2018-08-27T05:10:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

johnsca · 2018-08-27T14:11:21Z

I'm not sure what's going on here, but the relevant bit from the field agent seems to be this:

Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160018   32157 kubelet.go:1777] Starting kubelet main sync loop.
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: E0429 04:15:01.160066   32157 kubelet.go:1277] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160048   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s]
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160198   32157 desired_state_of_world_populator.go:129] Desired state populator starts to run
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160206   32157 server.go:944] Started kubelet
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160198   32157 server.go:299] Adding debug handlers to kubelet server.
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.260129   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.260230   32157 kubelet_node_status.go:271] Setting node annotation to enable volume controller attach/detach
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.261806   32157 kubelet_node_status.go:82] Attempting to register node juju-4222ff-4
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.271429   32157 kubelet_node_status.go:127] Node juju-4222ff-4 was previously registered
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.271708   32157 kubelet_node_status.go:85] Successfully registered node juju-4222ff-4
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.460253   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.860358   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:02 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:02.660481   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:03 juju-4222ff-4 kubelet.daemon[32157]: W0429 04:15:03.174732   32157 manager.go:340] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
Apr 29 04:15:04 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:04.260603   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:06.796736   32157 cpu_manager.go:155] [cpumanager] starting with none policy
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:06.796755   32157 cpu_manager.go:156] [cpumanager] reconciling every 10s
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:06.796764   32157 policy_none.go:42] [cpumanager] none policy: Start
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: W0429 04:15:06.796784   32157 fs.go:539] stat failed on /dev/loop3 with error: no such file or directory
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: F0429 04:15:06.796797   32157 kubelet.go:1354] Failed to start ContainerManager failed to get rootfs info: failed to get device for dir "/var/lib/kubelet": could not find device with major: 0, minor: 106 in cached partitions map
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: snap.kubelet.daemon.service: Main process exited, code=exited, status=255/n/a
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: snap.kubelet.daemon.service: Unit entered failed state.
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: snap.kubelet.daemon.service: Failed with result 'exit-code'.
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: snap.kubelet.daemon.service: Service hold-off time over, scheduling restart.
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: Stopped Service for snap application kubelet.daemon.
Apr 29 04:15:07 juju-4222ff-4 systemd[1]: Started Service for snap application kubelet.daemon.

@Cynerva Any ideas here? There's a lot more data in the cdk-field-agent reports.

Cynerva · 2018-08-27T14:21:17Z

This happens when LXD is configured to use the zfs or btrfs storage backend. You'll need to configure LXD to use the dir storage backend instead, then start a new deployment of canonical-kubernetes.

sumlin mentioned this issue May 3, 2018

kubelet daemon does not start in kubernetes workers - can't deploy locally #1426

Closed

3 tasks

stale bot added the wontfix label Aug 27, 2018

stale bot removed the wontfix label Aug 27, 2018

adam-stokes added the documentation label Aug 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet fails to start on worker node - keeps restarting #1421

Kubelet fails to start on worker node - keeps restarting #1421

akodd commented Apr 29, 2018

stale bot commented Aug 27, 2018

johnsca commented Aug 27, 2018

Cynerva commented Aug 27, 2018

Kubelet fails to start on worker node - keeps restarting #1421

Kubelet fails to start on worker node - keeps restarting #1421

Comments

akodd commented Apr 29, 2018

Report

Please provide the output of the following commands

Sosreport

What Spell was Selected?

What provider (aws, maas, localhost, etc)?

MAAS Users

Commands ran

Additional Information

stale bot commented Aug 27, 2018

johnsca commented Aug 27, 2018

Cynerva commented Aug 27, 2018