Skip to content
This repository has been archived by the owner on Apr 12, 2021. It is now read-only.

Kubelet fails to start on worker node - keeps restarting #1421

Open
3 tasks
akodd opened this issue Apr 29, 2018 · 3 comments
Open
3 tasks

Kubelet fails to start on worker node - keeps restarting #1421

akodd opened this issue Apr 29, 2018 · 3 comments

Comments

@akodd
Copy link

akodd commented Apr 29, 2018

Report

Kubelet fails to start on the worker. Worker periodically restarts kubelet. Master nodes indefinitely waits for pods to start.

See attached field agent reports.

root@juju-4222ff-4:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
juju-4222ff-4 NotReady 18m v1.10.0

Thank you for trying conjure-up! Before reporting a bug please make sure you've gone through this checklist:

Please provide the output of the following commands

Tried on current and reporting on edge version.

which juju
/snap/bin/juju
juju version
2.3.7-xenial-amd64

which conjure-up
/snap/bin/conjure-up
conjure-up --version
conjure-up 2.5.6

which lxc
/snap/bin/lxc

/snap/bin/lxc core.https_address: '[::]'
/snap/bin/lxc version
Client version: 3.0.0
Server version: 3.0.0

cat /etc/lsb-release

Please attach tarball of ~/.cache/conjure-up:

tar cvzf conjure-up.tar.gz ~/.cache/conjure-up

conjure-up.tar.gz

Sosreport

Please attach a sosreport:

sudo apt install sosreport
sosreport

The resulting output file can be attached to this issue.

What Spell was Selected?

canonical-kubernetes

What provider (aws, maas, localhost, etc)?

locahost

MAAS Users

Which version of MAAS?
not mass

Commands ran

conjure-up

Please outline what commands were run to install and execute conjure-up:
screen shot 2018-04-29 at 00 13 41

screen shot 2018-04-29 at 00 15 25

Additional Information

Output from cdk-field-agent

results-2018-04-28-23-20-23.tar.gz

results-2018-04-29-00-15-23.tar.gz

@stale
Copy link

stale bot commented Aug 27, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 27, 2018
@johnsca
Copy link
Contributor

johnsca commented Aug 27, 2018

I'm not sure what's going on here, but the relevant bit from the field agent seems to be this:

Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160018   32157 kubelet.go:1777] Starting kubelet main sync loop.
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: E0429 04:15:01.160066   32157 kubelet.go:1277] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160048   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s]
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160198   32157 desired_state_of_world_populator.go:129] Desired state populator starts to run
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160206   32157 server.go:944] Started kubelet
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.160198   32157 server.go:299] Adding debug handlers to kubelet server.
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.260129   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.260230   32157 kubelet_node_status.go:271] Setting node annotation to enable volume controller attach/detach
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.261806   32157 kubelet_node_status.go:82] Attempting to register node juju-4222ff-4
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.271429   32157 kubelet_node_status.go:127] Node juju-4222ff-4 was previously registered
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.271708   32157 kubelet_node_status.go:85] Successfully registered node juju-4222ff-4
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.460253   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:01 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:01.860358   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:02 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:02.660481   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:03 juju-4222ff-4 kubelet.daemon[32157]: W0429 04:15:03.174732   32157 manager.go:340] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
Apr 29 04:15:04 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:04.260603   32157 kubelet.go:1794] skipping pod synchronization - [container runtime is down]
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:06.796736   32157 cpu_manager.go:155] [cpumanager] starting with none policy
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:06.796755   32157 cpu_manager.go:156] [cpumanager] reconciling every 10s
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: I0429 04:15:06.796764   32157 policy_none.go:42] [cpumanager] none policy: Start
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: W0429 04:15:06.796784   32157 fs.go:539] stat failed on /dev/loop3 with error: no such file or directory
Apr 29 04:15:06 juju-4222ff-4 kubelet.daemon[32157]: F0429 04:15:06.796797   32157 kubelet.go:1354] Failed to start ContainerManager failed to get rootfs info: failed to get device for dir "/var/lib/kubelet": could not find device with major: 0, minor: 106 in cached partitions map
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: snap.kubelet.daemon.service: Main process exited, code=exited, status=255/n/a
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: snap.kubelet.daemon.service: Unit entered failed state.
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: snap.kubelet.daemon.service: Failed with result 'exit-code'.
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: snap.kubelet.daemon.service: Service hold-off time over, scheduling restart.
Apr 29 04:15:06 juju-4222ff-4 systemd[1]: Stopped Service for snap application kubelet.daemon.
Apr 29 04:15:07 juju-4222ff-4 systemd[1]: Started Service for snap application kubelet.daemon.

@Cynerva Any ideas here? There's a lot more data in the cdk-field-agent reports.

@stale stale bot removed the wontfix label Aug 27, 2018
@Cynerva
Copy link

Cynerva commented Aug 27, 2018

This happens when LXD is configured to use the zfs or btrfs storage backend. You'll need to configure LXD to use the dir storage backend instead, then start a new deployment of canonical-kubernetes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants