Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGX-enabled pods sometimes get created without SGX device mounted #1695

Open
Feelas opened this issue Mar 22, 2024 · 5 comments
Open

SGX-enabled pods sometimes get created without SGX device mounted #1695

Feelas opened this issue Mar 22, 2024 · 5 comments

Comments

@Feelas
Copy link

Feelas commented Mar 22, 2024

Describe the issue
Not entirely sure if that is a bug in IDP, perhaps it just needs some explanation on how to properly ensure that sgx.intel.com/epc is available already or an upstream K8s/kubelet issue.

We are seeing a randomly occuring situation where Pods with "requests: sgx.intel.com/epc: 1Gi" defined in the Pod Spec get created without the necessary /dev/sgx_enclave volume mount. This causes the workload to fail due to lack of the SGX device (ENOENT error in Gramine).

  • When inspected by crictl, the mount points are missing in the containers.
  • We've seen this happening for Pods which are managed by Deployments, StatefulSets and Jobs,
  • Usually, removing the affected resource (Depoy/STS/Job) & resyncing the resource (equal to 'kubectl replace') causes the Pod to start correctly,
  • Delaying a specific container (e.g. by using sleep in an init-container) does not change a lot, perhaps because this aspect is connected with Pod (and not container) creation time.

My (intuitive) understanding is that until "sgx.intel.com/epc" is registered on the node, the Pod should not get created (due to being unschedulable for insufficient resources)?
If that intuition is wrong, what is the correct way to make sure that the Pod creation waits for "sgx.intel.com/epc" availability? Since this process is done as part of an automated flow (ArgoCD), we are not currently fully sure in what order the resources will get created.

To Reproduce
No clear reproduction scenario, happens randomly. We did not find a correlation with specific OCP version.

Expected behavior
Pod is created with the necessary volume mounts with /dev/sgx_enclave each time.

System:

  • OS version: Red Hat OpenShift 4.14, 4.15
  • Kernel version: RH kernel versions based on 5.14
  • Device plugins version: v0.28 from OpenShift OperatorHub
  • Hardware info: 4th and 5th Gen Intel Xeon Processors with QAT

Additional context
It might be an edge-case similar to the issue noticed for nVidia GPU (this specific comment): NVIDIA/k8s-device-plugin#291 (comment).
Similar as in the linked issue, the Pod somehow gets started before sgx.intel.com/epc can be succesfully allocated and mounted. Very curious why that happens.

@mythi
Copy link
Contributor

mythi commented Mar 22, 2024

@Feelas can you check the pod's resources when the error happens?

I believe you only get

sgx.intel.com/epc: 1Gi

and not

sgx.intel.com/epc: 1Gi
sgx.intel.com/enclave 1

which then suggests the operator has failed to mutate the pod. you could check the API server logs for more details why the operator/webhook had failed

@tkatila
Copy link
Contributor

tkatila commented Mar 22, 2024

Not sure if it's related, but we've seen somewhat similar with e2e fpga. Seems like a Pod is scheduled with the generic resource name (Pod goes to Pending) and when the operator comes fully online it doesn't detect the Pending pod. Pod is left in Pending state.

@Feelas
Copy link
Author

Feelas commented Mar 25, 2024

@mythi we've managed to trigger the error again.

As a result:

  • sgx.intel.com/enclave was indeed missing from the resources section of the pod and thus Gramine failed.
  • Pod was stuck in CrashLoopBackOff for a very long time,
  • none "admission: Mutated SGX Pod" messages for this specific Pod were present in the logs (controller-manager Pod)
  • after DELETEing the Pod, instantly there are the logs with "admission: Mutated SGX Pod" for our workload and it starts correctly.

Given my limited knowlede of webhooks:

  • I took a quick look at webhook triggers and it looks like it is hooking on the CREATE operation of the Pod (webhooks.rules for SGX MutatingWebhookConfiguiration).
  • Will a restarting Pod match a "pods Create" rule for a Webhook? Maybe that is the reason for this behaviour.

If yes, it looks like the webhook might not handle "retroactively" mutating Pods w/ SGX requests that were created on the cluster before the webhook itself is created and registered, unless of course I am missing something about how the webhooks work.

@mythi
Copy link
Contributor

mythi commented Mar 25, 2024

If yes, it looks like the webhook might not handle "retroactively" mutating Pods w/ SGX requests that were created on the cluster before the webhook itself is created and registered, unless of course I am missing something about how the webhooks work.

I'd like to understand why the mutation gets missed on CREATE because that's what starts CrashLoopBackOff.

@Feelas
Copy link
Author

Feelas commented Mar 25, 2024

If yes, it looks like the webhook might not handle "retroactively" mutating Pods w/ SGX requests that were created on the cluster before the webhook itself is created and registered, unless of course I am missing something about how the webhooks work.

I'd like to understand why the mutation gets missed on CREATE because that's what starts CrashLoopBackOff.

Same, the only idea that would make sense would be if Retries for spawning the Pod wouldn't throw a CREATE.

So the timeline would look like this, in my mind:
Pod gets created (causes CREATE) and is trying to spawn -> Webhook is initialized -> Pod is stuck restarting and not throwing CREATE events -> Webhook gets initialized, registered & responds but no CREATE events are passed from the failing Pod.

And by this point it would boil down to the question "when can we be sure that the CREATE hook is already up"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants