Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cri-o support #10313

Open
hbhasker opened this issue Apr 22, 2024 · 21 comments
Open

cri-o support #10313

hbhasker opened this issue Apr 22, 2024 · 21 comments
Labels
type: enhancement New feature or request

Comments

@hbhasker
Copy link
Contributor

Description

I have been tinkering to get runsc to work with crio and I think I have it mostly working. Except for the fact that initContainers don't seem to be supported properly.

Reading the multi container support code it seems like gVisor expects the first container to be the root container but this assumptions breaks down when the first container is an init container which runs to completion.

I am not sure what is the right way to handle this as the CRI spec doesn't seem to indicate if a container is an init container.

Opening it here for discussion.

Is this feature related to a specific bug?

No response

Do you have a specific solution in mind?

No response

@hbhasker hbhasker added the type: enhancement New feature or request label Apr 22, 2024
@ayushr2
Copy link
Collaborator

ayushr2 commented Apr 22, 2024

IIUC k8s also has initContainers which gVisor supports. From my understanding, the root container is still the pause container and init containers are run as "regular subcontainers". The init containers are not the root/first containers. You could try a pod spec with initContainers in GKE Sandbox.

Could you share a initContainers reproducer which fails?

@hbhasker
Copy link
Contributor Author

I don't have a simple reproducer since our infra is a bit complicated with kube manifests being generated etc. But I never see a request to create a pause container being sent to runsc at all. The first request it gets is for the first init container.

Which flunks out with the error mentioned here #3283 (comment).

From what I could understand this happens because of this check

if isRoot(args.Spec) {

Now for the initContainer this check fails and it falls through trying to create the subcontainer which flunks out because the statefile hasn't been created since the sandbox was never created as this is the first container but its ID doesn't match the sandbox ID.

In fact I don't even see any pause containers being run on the node at all. This is on kube 1.26 btw.

@hbhasker
Copy link
Contributor Author

So I did some digging and looks like CRI-o as of 1.17 drops the pause container, more details here : cri-o/cri-o#91 and only injects it if either it's explicitly enabled or a vm like runtime( kernel separated) is used (kata) or if shared pid namespaces are enabled for a pod.

@hbhasker hbhasker changed the title InitContainers support cri-o support Apr 22, 2024
@hbhasker
Copy link
Contributor Author

@fvoznika fyi for your input.

@fvoznika
Copy link
Member

We treat the root container special, because that's the one to holds the sandbox alive. Until we move to the new sandbox API in containerd (not planned), this is unlikely to change. Can you make cri-o treat gVisor like a VM too and inject the pause container?

@hbhasker
Copy link
Contributor Author

I can try that. Might need to upstream a change to allow to enable it only for gvisor workloads. I don't want a pause container in other workloads.

@hbhasker
Copy link
Contributor Author

Alternatively can we add something in gvisor akin to the pause container to keep the namespaces alive?or do we need it to come from cri-o

@hbhasker
Copy link
Contributor Author

So I enabled pause containers on one node by adding "drop_infra_ctr= false" to the crio runtime section in /etc/crio/crio.conf which has stopped all the crashing and I can see that runsc comes up and a gofer and when I inspect its running the pause container. I can even see the init containers being spawned/run and exiting with zero error code but for some reason kubelet is not proceeding beyond that. It still thinks the init container is failing and keeps restarting the same init container.

Going to enable strace to see what's going on though the process seems to be exiting with a zero error code.

@hbhasker
Copy link
Contributor Author

hbhasker commented Apr 23, 2024

Some more progress. I have narrowed it down to the fact that the state command runs correctly and returns container state but CRIo is returning the container state to kubelet with an exit code of -1. I think somehow conmon is not returning the right exit code or doesn't know so CRIO sticks in -1.

eg response that crio is sending to kubelet

Apr 23 16:31:32 i-079b567f773edebaf crio[70942]: time="2024-04-23 16:31:32.299833995Z" level=debug msg="Response: &ContainerStatusResponse{Status:&ContainerStatus{Id:7cf41ec61800919cffd828b2e8a9f9dc9b85b2053541a7abf47920cdc7b43d96,Metadata:&ContainerMetadata{Name:xxxxxx, Attempt:155,},State:CONTAINER_EXITED,CreatedAt:1713889590522640225,StartedAt:1713889591519766248,FinishedAt:1713889590693369610,ExitCode:-1

and corresponding kubelet log that is causing it to restart init container

InitContainerStatuses:[{Name:xxxx State:{Waiting:&ContainerStateWaiting{Reason:CrashLoopBackOff,Message:back-off 5m0s restarting failed container=... pod=...(382f29e6-8e74-425c-9392-c641d713e74a),} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:-1,Signal:0,Reason:Error,Message:,StartedAt:2024-04-23 16:36:36 +0000 UTC,FinishedAt:2024-04-23 16:36:35 +0000 UTC,ContainerID:cri-o://df50944c0754a4538084d9e8dcb4e77c55679cf2c8f86dbef3ebd49a1ba331ac,}} Ready:false RestartCount:157

I even reduced it so that I removed all sidecars and init containers and kubelet still thinks the main container is dead and keeps trying to restart it even though runsc is running and the container is up and running.

Somehow the status checking in crio is broken for gvisor and it reports incorrect Exit code (-1) to kubelet which is the reason for this whole messed up behaviour.

I am suspecting it has to do with conmon which crio uses for container monitoring. Somewhere something is messed up and its setting the exit code to -1 in some odd case resulting in this behaviour.

@hbhasker
Copy link
Contributor Author

I think the issue might be due to conmon https://github.com/containers/conmon/blob/main/src/conmon.c which crio uses to monitor containers and is probably failing because its not seeing the container processes maybe?

@haircommander
Copy link

haircommander commented Apr 24, 2024

cri-o by default "drops" (chooses not to create in most cases) the pause container. You can use the drop_infra_ctr = false option in the crio configuration to make cri-o create the pause container 🙂

@hbhasker
Copy link
Contributor Author

Did try that out but that still doesn't fix the issue. Runsc does come up fine and spins up the container etc but since the status returned to kubelet is still that of a failed container it keeps trying to run the container over and over again.

@hbhasker
Copy link
Contributor Author

So documenting what I found out till now after talking with @haircommander (Thanks btw!). The issue is conmon as it relies on the container process being spawned directly by the runtime handler and then waiting on the child to exit etc. This doesn't work in gvisor as the container process is spawned in the user-space kernel.

Reading the runsc containerd shim code it seems it relies on a non-OCI spec command "runsc wait" to get container status. So its likely we need something similar but as it happens crio does support containerd as a runtime monitoring for kata containers if the runtime handler type is set to "vm" (https://github.com/cri-o/cri-o/blob/main/internal/oci/runtime_vm.go).

Reading that code outside of specifying mount options for kata it seems it might work as long as we can get the containerd to use the runsc shim instead of the kata shim. Going to try it and see if i can get that to work.

@fvoznika this seems like something that could work and would require probably minimal effort to make CRIO work with gvisor.

@hbhasker
Copy link
Contributor Author

@fvoznika something else I just realized , gvisor seems to be using a really old containerd shim API (v1.4.13) which has been deprecated since 2022.

https://github.com/containerd/containerd/blob/main/RELEASES.md#kubernetes-support

In fact reading the above table I am surprised it even works with newer k8s. Is there a plan to upgrade the shim API gvisor uses?

I was very confused initially since the code gvisor references doesn't even exist on the main branch in containerd repo anywhere.

@hbhasker
Copy link
Contributor Author

Another stumbling block, after changing the crio config to use containerd-shim-runsc-v1 it would just not recognize the handler..

Took me a lot of time to figure out that it rejects the config because the gvisor shim file name (even though it uses containerd shim v2 api) ends in v1 and crio 1.26 has an actual check that the file name matches a regex that ends in v2.

see:
https://github.com/google/gvisor-containerd-shim?tab=readme-ov-file#configuration

&

https://github.com/cri-o/cri-o/blob/b884aca4deeb79d582e247610a67b580945a2090/pkg/config/config.go#L58 which is used here

https://github.com/cri-o/cri-o/blob/b884aca4deeb79d582e247610a67b580945a2090/pkg/config/config.go#L1385

@hbhasker
Copy link
Contributor Author

hbhasker commented Apr 25, 2024

Got it to work with a few changes in the shim. crio never creates the log that the shim expects to be created resulting in runsc failing to start.

  1. crio never sets the type_url in the code that starts the shim resulting in runsc failing to get the options properly . I just hacked it up to comment out that code for now

  2. Here's how I have it setup in the configs

cat /etc/crio/crio.conf.d/99-gvisor.conf
[crio.runtime.runtimes.runsc]
runtime_path = "/usr/local/bin/containerd-shim-runsc-v2"
runtime_config_path = "/etc/containerd/runsc.toml"
runtime_type = "vm"
runtime_root = "/run/runc"
cat /etc/containerd/runsc.toml
binary_name = "/usr/local/bin/runsc"
root = "/run/runc"
[runsc_config]
  systemd-cgroup = "true"
  debug = "true"
  debug-log = "/var/log/runsc/"
  root = "/run/runc"

Finally I am reasonably sure the containerd/config.toml is not required though I did create it since runsc never really runs a containerd process.

git diff
diff --git a/pkg/shim/proc/init.go b/pkg/shim/proc/init.go
index fcdc8e2bc..838539bcd 100644
--- a/pkg/shim/proc/init.go
+++ b/pkg/shim/proc/init.go
@@ -86,11 +86,11 @@ func NewRunsc(root, path, namespace, runtime string, config map[string]string, s
        return &runsc.Runsc{
                Command:      runtime,
                PdeathSignal: unix.SIGKILL,
-               Log:          filepath.Join(path, "log.json"),
-               LogFormat:    runc.JSON,
-               PanicLog:     utils.PanicLogPath(spec),
-               Root:         filepath.Join(root, namespace),
-               Config:       config,
+               //Log:          filepath.Join(path, "log.json"),
+               //LogFormat:    runc.JSON,
+               PanicLog: utils.PanicLogPath(spec),
+               Root:     filepath.Join(root, namespace),
+               Config:   config,
        }
 }

diff --git a/pkg/shim/service.go b/pkg/shim/service.go
index a6904e1ae..baffed3f3 100644
--- a/pkg/shim/service.go
+++ b/pkg/shim/service.go
@@ -50,7 +50,7 @@ import (
        "github.com/sirupsen/logrus"
        "golang.org/x/sys/unix"
        "gvisor.dev/gvisor/pkg/cleanup"
-       "gvisor.dev/gvisor/pkg/shim/runtimeoptions/v14"
+       v14 "gvisor.dev/gvisor/pkg/shim/runtimeoptions/v14"

        "gvisor.dev/gvisor/pkg/shim/proc"
        "gvisor.dev/gvisor/pkg/shim/runsc"
@@ -372,20 +372,26 @@ func (s *service) create(ctx context.Context, r *taskAPI.CreateTaskRequest) (*ta
                        if o.ConfigPath == "" {
                                break
                        }
-                       if o.TypeUrl != optionsType {
-                               return nil, fmt.Errorf("unsupported option type %q", o.TypeUrl)
-                       }
+                       //if o.TypeUrl != optionsType {
+                       //      return nil, fmt.Errorf("unsupported option type %q", o.TypeUrl)
+                       //}
                        path = o.ConfigPath
                case *v14.Options: // containerd 1.4-
                        if o.ConfigPath == "" {
                                break
                        }
-                       if o.TypeUrl != optionsType {
-                               return nil, fmt.Errorf("unsupported option type %q", o.TypeUrl)
-                       }
+                       // if o.TypeUrl != optionsType {
+                       //      return nil, fmt.Errorf("unsupported option type %q", o.TypeUrl)
+                       // }
                        path = o.ConfigPath
                default:
-                       return nil, fmt.Errorf("unsupported option type %q", r.Options.TypeUrl)
+                       log.L.Errorf("unsupported option type %q, assuming its containerd 1.5+", r.Options.TypeUrl)
+                       var opts runtimeoptions.Options
+                       if err := typeurl.UnmarshalTo(r.Options, &opts); err != nil {
+                               return nil, fmt.Errorf("failed to unmarshal options %q:", r.Options)
+                       }
+                       path = opts.ConfigPath
+                       //return nil, fmt.Errorf("unsupported option type %q", r.Options.TypeUrl)
                }
                if path != "" {
                        if _, err = toml.DecodeFile(path, &s.opts); err != nil {

@hbhasker
Copy link
Contributor Author

That said I think its not all working properly kubelet readiness probe seems to be failing

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  2m26s (x5879 over 16h)  kubelet  Readiness probe errored: rpc error: code = Unknown desc = ExecSyncContainer failed: ttrpc: closed: unknown

Also a container that clearly crashed in the pod (the pod had 5 and one crashed) shows as running. So not all container monitoring is working as expected.

@hbhasker
Copy link
Contributor Author

Looks like w/ systrap something get's messed up around detecting container exits. When I switched to using ptrace at least the API server clearly shows that only 4/5 containers are running.

The readiness probe still fails which I believe ends up here https://github.com/cri-o/cri-o/blob/0650d0e00fe9637f2bb8e4cac1b678345289ce61/internal/oci/runtime_vm.go#L376 but I haven't looked into it enough to see what it's doing.

@hbhasker
Copy link
Contributor Author

@avagin fyi ^^^

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Apr 25, 2024

Looks like w/ systrap something get's messed up around detecting container exits. When I switched to using ptrace at least the API server clearly shows that only 4/5 containers are running.

It would be good to get runsc logs to see if there's a platform-dependent issue. If it is, that's a separate bug from this one. Note that Systrap recently changed to require SECCOMP_RET_USER_NOTIF which is a Linux >= 5.0 feature.

@hbhasker
Copy link
Contributor Author

btw even with ptrace from what I can see crictl still thinks the container is alive even though the process inside the container is definitely dead from logs. I will run it again with systrap and see what is the difference between systrap and ptrace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants