Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU crashing on 1 node. #1628

Open
ryanm101 opened this issue Dec 16, 2023 · 8 comments
Open

GPU crashing on 1 node. #1628

ryanm101 opened this issue Dec 16, 2023 · 8 comments

Comments

@ryanm101
Copy link

NAME   STATUS   ROLES                                       AGE     VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION          CONTAINER-RUNTIME
nuc1   Ready    control-plane,etcd,master,worker            2y40d   v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 38 (Server Edition)   6.5.6-200.fc38.x86_64   containerd://1.7.6-k3s1.26
nuc2   Ready    control-plane,coral.ai,etcd,master,worker   127m    v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 39 (Server Edition)   6.6.2-201.fc39.x86_64   containerd://1.7.6-k3s1.26
nuc3   Ready    control-plane,etcd,master,worker            42d     v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 38 (Server Edition)   6.5.8-200.fc38.x86_64   containerd://1.7.6-k3s1.26

Running 3 master nodes using k3s
NUC 1 & 3 both deploy fine.
NUC 2 the container crashes with

E1216 11:45:32.208374       1 manager.go:146] Failed to serve gpu.intel.com/i915: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"
Cannot register to kubelet service
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).registerWithKubelet
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:352
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).setupAndServe
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:280
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).Serve
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:207
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).handleUpdate.func1
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:144
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1598

command used to provision NUC2:

curl -sfL https://get.k3s.io | K3S_URL=https://cluster.domain:6443 K3S_TOKEN=1:server:1 INSTALL_K3S_VERSION=v1.26.9+k3s1 sh -s - server --flannel-backend=none --disable-network-policy --cluster-cidr=x.x.x.x/x --service-cidr=x.x.x.x/x --cluster-init --disable=servicelb --disable traefik --selinux

The only differences between NUC2 and NUC1/3 are:

  1. NUC2 is FC39 and the others are FC38
  2. When starting k3s on NUC2 it complained about selinux and said to add '--selinux' to the startup command (the other two nodes dont have this)

Any advice appreciated.
I will test re-adding the node without the --selinux and if all else fails change it to FC38.

@tkatila
Copy link
Contributor

tkatila commented Dec 18, 2023

Hi @ryanm101

I found a bit similar error here: intel/intel-technology-enabling-for-openshift#113. There are a couple of workarounds in the issue that could work. Could you try them out?

@tkatila
Copy link
Contributor

tkatila commented Dec 18, 2023

I reproduced the issue on a VM. Device plugin seems to work without selinux but fails with selinux. In the selinux audit logs there is an entry:

type=AVC msg=audit(1702889339.432:3913): avc:  denied  { connectto } for  pid=16332 comm="intel_gpu_devic" path="/var/lib/kubelet/device-plugins/kubelet.sock" scontext=system_u:system_r:container_device_plugin_t:s0:c620,c968 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=0

I'll need to study if this is similar/same as the above linked issue.

EDIT: using setenforce 0 is a workaround. Though, not plausible if selinux is required.

@ryanm101
Copy link
Author

setenforce 0 corrects it but Nuc1&3 are both enforcing and working fine.

@tkatila
Copy link
Contributor

tkatila commented Dec 18, 2023

I followed instructions from the audit entry:

sudo ausearch -c 'intel_gpu_devic' --raw | audit2allow -M intelgpudevice
sudo semodule -X 300 -i intelgpudevice.pp

That seems to allow device plugin to access kubelet. I'm not sure where we should file a bug to: FC, k3s or somewhere else.

@mregmi
Copy link
Member

mregmi commented Dec 21, 2023

The plugins already run with proper label to have access to kubelet. That policy went into container-selinux package. Is that package installed on your node?

@ryanm101
Copy link
Author

Those get installed alongside k3s. and are installed.

@ryanm101
Copy link
Author

I followed instructions from the audit entry:

sudo ausearch -c 'intel_gpu_devic' --raw | audit2allow -M intelgpudevice
sudo semodule -X 300 -i intelgpudevice.pp

That seems to allow device plugin to access kubelet. I'm not sure where we should file a bug to: FC, k3s or somewhere else.

Yes this seems to solve it.

@tkatila
Copy link
Contributor

tkatila commented Dec 29, 2023

@mregmi do you happen to know the container-selinux version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants