cri-o support #10313

hbhasker · 2024-04-22T03:31:43Z

Description

I have been tinkering to get runsc to work with crio and I think I have it mostly working. Except for the fact that initContainers don't seem to be supported properly.

Reading the multi container support code it seems like gVisor expects the first container to be the root container but this assumptions breaks down when the first container is an init container which runs to completion.

I am not sure what is the right way to handle this as the CRI spec doesn't seem to indicate if a container is an init container.

Opening it here for discussion.

Is this feature related to a specific bug?

No response

Do you have a specific solution in mind?

No response

ayushr2 · 2024-04-22T04:51:45Z

IIUC k8s also has initContainers which gVisor supports. From my understanding, the root container is still the pause container and init containers are run as "regular subcontainers". The init containers are not the root/first containers. You could try a pod spec with initContainers in GKE Sandbox.

Could you share a initContainers reproducer which fails?

hbhasker · 2024-04-22T05:04:39Z

I don't have a simple reproducer since our infra is a bit complicated with kube manifests being generated etc. But I never see a request to create a pause container being sent to runsc at all. The first request it gets is for the first init container.

Which flunks out with the error mentioned here #3283 (comment).

From what I could understand this happens because of this check

gvisor/runsc/container/container.go

Line 259 in 4810afc

if isRoot(args.Spec) {

Now for the initContainer this check fails and it falls through trying to create the subcontainer which flunks out because the statefile hasn't been created since the sandbox was never created as this is the first container but its ID doesn't match the sandbox ID.

In fact I don't even see any pause containers being run on the node at all. This is on kube 1.26 btw.

hbhasker · 2024-04-22T16:19:33Z

So I did some digging and looks like CRI-o as of 1.17 drops the pause container, more details here : cri-o/cri-o#91 and only injects it if either it's explicitly enabled or a vm like runtime( kernel separated) is used (kata) or if shared pid namespaces are enabled for a pod.

hbhasker · 2024-04-22T20:24:46Z

@fvoznika fyi for your input.

fvoznika · 2024-04-22T23:38:57Z

We treat the root container special, because that's the one to holds the sandbox alive. Until we move to the new sandbox API in containerd (not planned), this is unlikely to change. Can you make cri-o treat gVisor like a VM too and inject the pause container?

hbhasker · 2024-04-23T01:14:37Z

I can try that. Might need to upstream a change to allow to enable it only for gvisor workloads. I don't want a pause container in other workloads.

hbhasker · 2024-04-23T01:16:24Z

Alternatively can we add something in gvisor akin to the pause container to keep the namespaces alive?or do we need it to come from cri-o

hbhasker · 2024-04-23T03:28:07Z

So I enabled pause containers on one node by adding "drop_infra_ctr= false" to the crio runtime section in /etc/crio/crio.conf which has stopped all the crashing and I can see that runsc comes up and a gofer and when I inspect its running the pause container. I can even see the init containers being spawned/run and exiting with zero error code but for some reason kubelet is not proceeding beyond that. It still thinks the init container is failing and keeps restarting the same init container.

Going to enable strace to see what's going on though the process seems to be exiting with a zero error code.

hbhasker · 2024-04-23T20:27:17Z

Some more progress. I have narrowed it down to the fact that the state command runs correctly and returns container state but CRIo is returning the container state to kubelet with an exit code of -1. I think somehow conmon is not returning the right exit code or doesn't know so CRIO sticks in -1.

eg response that crio is sending to kubelet

Apr 23 16:31:32 i-079b567f773edebaf crio[70942]: time="2024-04-23 16:31:32.299833995Z" level=debug msg="Response: &ContainerStatusResponse{Status:&ContainerStatus{Id:7cf41ec61800919cffd828b2e8a9f9dc9b85b2053541a7abf47920cdc7b43d96,Metadata:&ContainerMetadata{Name:xxxxxx, Attempt:155,},State:CONTAINER_EXITED,CreatedAt:1713889590522640225,StartedAt:1713889591519766248,FinishedAt:1713889590693369610,ExitCode:-1

and corresponding kubelet log that is causing it to restart init container

InitContainerStatuses:[{Name:xxxx State:{Waiting:&ContainerStateWaiting{Reason:CrashLoopBackOff,Message:back-off 5m0s restarting failed container=... pod=...(382f29e6-8e74-425c-9392-c641d713e74a),} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:-1,Signal:0,Reason:Error,Message:,StartedAt:2024-04-23 16:36:36 +0000 UTC,FinishedAt:2024-04-23 16:36:35 +0000 UTC,ContainerID:cri-o://df50944c0754a4538084d9e8dcb4e77c55679cf2c8f86dbef3ebd49a1ba331ac,}} Ready:false RestartCount:157

I even reduced it so that I removed all sidecars and init containers and kubelet still thinks the main container is dead and keeps trying to restart it even though runsc is running and the container is up and running.

Somehow the status checking in crio is broken for gvisor and it reports incorrect Exit code (-1) to kubelet which is the reason for this whole messed up behaviour.

I am suspecting it has to do with conmon which crio uses for container monitoring. Somewhere something is messed up and its setting the exit code to -1 in some odd case resulting in this behaviour.

hbhasker · 2024-04-23T20:40:24Z

I think the issue might be due to conmon https://github.com/containers/conmon/blob/main/src/conmon.c which crio uses to monitor containers and is probably failing because its not seeing the container processes maybe?

haircommander · 2024-04-24T13:35:11Z

cri-o by default "drops" (chooses not to create in most cases) the pause container. You can use the drop_infra_ctr = false option in the crio configuration to make cri-o create the pause container 🙂

hbhasker · 2024-04-24T13:51:57Z

Did try that out but that still doesn't fix the issue. Runsc does come up fine and spins up the container etc but since the status returned to kubelet is still that of a failed container it keeps trying to run the container over and over again.

hbhasker · 2024-04-24T15:33:49Z

So documenting what I found out till now after talking with @haircommander (Thanks btw!). The issue is conmon as it relies on the container process being spawned directly by the runtime handler and then waiting on the child to exit etc. This doesn't work in gvisor as the container process is spawned in the user-space kernel.

Reading the runsc containerd shim code it seems it relies on a non-OCI spec command "runsc wait" to get container status. So its likely we need something similar but as it happens crio does support containerd as a runtime monitoring for kata containers if the runtime handler type is set to "vm" (https://github.com/cri-o/cri-o/blob/main/internal/oci/runtime_vm.go).

Reading that code outside of specifying mount options for kata it seems it might work as long as we can get the containerd to use the runsc shim instead of the kata shim. Going to try it and see if i can get that to work.

@fvoznika this seems like something that could work and would require probably minimal effort to make CRIO work with gvisor.

hbhasker · 2024-04-24T17:41:35Z

@fvoznika something else I just realized , gvisor seems to be using a really old containerd shim API (v1.4.13) which has been deprecated since 2022.

https://github.com/containerd/containerd/blob/main/RELEASES.md#kubernetes-support

In fact reading the above table I am surprised it even works with newer k8s. Is there a plan to upgrade the shim API gvisor uses?

I was very confused initially since the code gvisor references doesn't even exist on the main branch in containerd repo anywhere.

hbhasker · 2024-04-24T21:44:48Z

Another stumbling block, after changing the crio config to use containerd-shim-runsc-v1 it would just not recognize the handler..

Took me a lot of time to figure out that it rejects the config because the gvisor shim file name (even though it uses containerd shim v2 api) ends in v1 and crio 1.26 has an actual check that the file name matches a regex that ends in v2.

see:
https://github.com/google/gvisor-containerd-shim?tab=readme-ov-file#configuration

&

https://github.com/cri-o/cri-o/blob/b884aca4deeb79d582e247610a67b580945a2090/pkg/config/config.go#L58 which is used here

https://github.com/cri-o/cri-o/blob/b884aca4deeb79d582e247610a67b580945a2090/pkg/config/config.go#L1385

hbhasker · 2024-04-25T00:27:34Z

Got it to work with a few changes in the shim. crio never creates the log that the shim expects to be created resulting in runsc failing to start.

crio never sets the type_url in the code that starts the shim resulting in runsc failing to get the options properly . I just hacked it up to comment out that code for now
Here's how I have it setup in the configs

cat /etc/crio/crio.conf.d/99-gvisor.conf
[crio.runtime.runtimes.runsc]
runtime_path = "/usr/local/bin/containerd-shim-runsc-v2"
runtime_config_path = "/etc/containerd/runsc.toml"
runtime_type = "vm"
runtime_root = "/run/runc"

cat /etc/containerd/runsc.toml
binary_name = "/usr/local/bin/runsc"
root = "/run/runc"
[runsc_config]
  systemd-cgroup = "true"
  debug = "true"
  debug-log = "/var/log/runsc/"
  root = "/run/runc"

Finally I am reasonably sure the containerd/config.toml is not required though I did create it since runsc never really runs a containerd process.

git diff
diff --git a/pkg/shim/proc/init.go b/pkg/shim/proc/init.go
index fcdc8e2bc..838539bcd 100644
--- a/pkg/shim/proc/init.go
+++ b/pkg/shim/proc/init.go
@@ -86,11 +86,11 @@ func NewRunsc(root, path, namespace, runtime string, config map[string]string, s
        return &runsc.Runsc{
                Command:      runtime,
                PdeathSignal: unix.SIGKILL,
-               Log:          filepath.Join(path, "log.json"),
-               LogFormat:    runc.JSON,
-               PanicLog:     utils.PanicLogPath(spec),
-               Root:         filepath.Join(root, namespace),
-               Config:       config,
+               //Log:          filepath.Join(path, "log.json"),
+               //LogFormat:    runc.JSON,
+               PanicLog: utils.PanicLogPath(spec),
+               Root:     filepath.Join(root, namespace),
+               Config:   config,
        }
 }

diff --git a/pkg/shim/service.go b/pkg/shim/service.go
index a6904e1ae..baffed3f3 100644
--- a/pkg/shim/service.go
+++ b/pkg/shim/service.go
@@ -50,7 +50,7 @@ import (
        "github.com/sirupsen/logrus"
        "golang.org/x/sys/unix"
        "gvisor.dev/gvisor/pkg/cleanup"
-       "gvisor.dev/gvisor/pkg/shim/runtimeoptions/v14"
+       v14 "gvisor.dev/gvisor/pkg/shim/runtimeoptions/v14"

        "gvisor.dev/gvisor/pkg/shim/proc"
        "gvisor.dev/gvisor/pkg/shim/runsc"
@@ -372,20 +372,26 @@ func (s *service) create(ctx context.Context, r *taskAPI.CreateTaskRequest) (*ta
                        if o.ConfigPath == "" {
                                break
                        }
-                       if o.TypeUrl != optionsType {
-                               return nil, fmt.Errorf("unsupported option type %q", o.TypeUrl)
-                       }
+                       //if o.TypeUrl != optionsType {
+                       //      return nil, fmt.Errorf("unsupported option type %q", o.TypeUrl)
+                       //}
                        path = o.ConfigPath
                case *v14.Options: // containerd 1.4-
                        if o.ConfigPath == "" {
                                break
                        }
-                       if o.TypeUrl != optionsType {
-                               return nil, fmt.Errorf("unsupported option type %q", o.TypeUrl)
-                       }
+                       // if o.TypeUrl != optionsType {
+                       //      return nil, fmt.Errorf("unsupported option type %q", o.TypeUrl)
+                       // }
                        path = o.ConfigPath
                default:
-                       return nil, fmt.Errorf("unsupported option type %q", r.Options.TypeUrl)
+                       log.L.Errorf("unsupported option type %q, assuming its containerd 1.5+", r.Options.TypeUrl)
+                       var opts runtimeoptions.Options
+                       if err := typeurl.UnmarshalTo(r.Options, &opts); err != nil {
+                               return nil, fmt.Errorf("failed to unmarshal options %q:", r.Options)
+                       }
+                       path = opts.ConfigPath
+                       //return nil, fmt.Errorf("unsupported option type %q", r.Options.TypeUrl)
                }
                if path != "" {
                        if _, err = toml.DecodeFile(path, &s.opts); err != nil {

hbhasker · 2024-04-25T16:45:43Z

That said I think its not all working properly kubelet readiness probe seems to be failing

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  2m26s (x5879 over 16h)  kubelet  Readiness probe errored: rpc error: code = Unknown desc = ExecSyncContainer failed: ttrpc: closed: unknown

Also a container that clearly crashed in the pod (the pod had 5 and one crashed) shows as running. So not all container monitoring is working as expected.

hbhasker · 2024-04-25T17:00:45Z

Looks like w/ systrap something get's messed up around detecting container exits. When I switched to using ptrace at least the API server clearly shows that only 4/5 containers are running.

The readiness probe still fails which I believe ends up here https://github.com/cri-o/cri-o/blob/0650d0e00fe9637f2bb8e4cac1b678345289ce61/internal/oci/runtime_vm.go#L376 but I haven't looked into it enough to see what it's doing.

hbhasker · 2024-04-25T21:20:49Z

@avagin fyi ^^^

EtiennePerot · 2024-04-25T23:06:06Z

Looks like w/ systrap something get's messed up around detecting container exits. When I switched to using ptrace at least the API server clearly shows that only 4/5 containers are running.

It would be good to get runsc logs to see if there's a platform-dependent issue. If it is, that's a separate bug from this one. Note that Systrap recently changed to require SECCOMP_RET_USER_NOTIF which is a Linux >= 5.0 feature.

hbhasker · 2024-04-26T00:03:57Z

btw even with ptrace from what I can see crictl still thinks the container is alive even though the process inside the container is definitely dead from logs. I will run it again with systrap and see what is the difference between systrap and ptrace.

hbhasker added the type: enhancement New feature or request label Apr 22, 2024

hbhasker changed the title ~~InitContainers support~~ cri-o support Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cri-o support #10313

cri-o support #10313

hbhasker commented Apr 22, 2024

ayushr2 commented Apr 22, 2024

hbhasker commented Apr 22, 2024

hbhasker commented Apr 22, 2024

hbhasker commented Apr 22, 2024

fvoznika commented Apr 22, 2024

hbhasker commented Apr 23, 2024

hbhasker commented Apr 23, 2024

hbhasker commented Apr 23, 2024

hbhasker commented Apr 23, 2024 •

edited

hbhasker commented Apr 23, 2024

haircommander commented Apr 24, 2024 •

edited

hbhasker commented Apr 24, 2024

hbhasker commented Apr 24, 2024

hbhasker commented Apr 24, 2024

hbhasker commented Apr 24, 2024

hbhasker commented Apr 25, 2024 •

edited

hbhasker commented Apr 25, 2024

hbhasker commented Apr 25, 2024

hbhasker commented Apr 25, 2024

EtiennePerot commented Apr 25, 2024 •

edited

hbhasker commented Apr 26, 2024

cri-o support #10313

cri-o support #10313

Comments

hbhasker commented Apr 22, 2024

Description

Is this feature related to a specific bug?

Do you have a specific solution in mind?

ayushr2 commented Apr 22, 2024

hbhasker commented Apr 22, 2024

hbhasker commented Apr 22, 2024

hbhasker commented Apr 22, 2024

fvoznika commented Apr 22, 2024

hbhasker commented Apr 23, 2024

hbhasker commented Apr 23, 2024

hbhasker commented Apr 23, 2024

hbhasker commented Apr 23, 2024 • edited

hbhasker commented Apr 23, 2024

haircommander commented Apr 24, 2024 • edited

hbhasker commented Apr 24, 2024

hbhasker commented Apr 24, 2024

hbhasker commented Apr 24, 2024

hbhasker commented Apr 24, 2024

hbhasker commented Apr 25, 2024 • edited

hbhasker commented Apr 25, 2024

hbhasker commented Apr 25, 2024

hbhasker commented Apr 25, 2024

EtiennePerot commented Apr 25, 2024 • edited

hbhasker commented Apr 26, 2024

hbhasker commented Apr 23, 2024 •

edited

haircommander commented Apr 24, 2024 •

edited

hbhasker commented Apr 25, 2024 •

edited

EtiennePerot commented Apr 25, 2024 •

edited