SGX-enabled pods sometimes get created without SGX device mounted #1695

Feelas · 2024-03-22T10:18:25Z

Describe the issue
Not entirely sure if that is a bug in IDP, perhaps it just needs some explanation on how to properly ensure that sgx.intel.com/epc is available already or an upstream K8s/kubelet issue.

We are seeing a randomly occuring situation where Pods with "requests: sgx.intel.com/epc: 1Gi" defined in the Pod Spec get created without the necessary /dev/sgx_enclave volume mount. This causes the workload to fail due to lack of the SGX device (ENOENT error in Gramine).

When inspected by crictl, the mount points are missing in the containers.
We've seen this happening for Pods which are managed by Deployments, StatefulSets and Jobs,
Usually, removing the affected resource (Depoy/STS/Job) & resyncing the resource (equal to 'kubectl replace') causes the Pod to start correctly,
Delaying a specific container (e.g. by using sleep in an init-container) does not change a lot, perhaps because this aspect is connected with Pod (and not container) creation time.

My (intuitive) understanding is that until "sgx.intel.com/epc" is registered on the node, the Pod should not get created (due to being unschedulable for insufficient resources)?
If that intuition is wrong, what is the correct way to make sure that the Pod creation waits for "sgx.intel.com/epc" availability? Since this process is done as part of an automated flow (ArgoCD), we are not currently fully sure in what order the resources will get created.

To Reproduce
No clear reproduction scenario, happens randomly. We did not find a correlation with specific OCP version.

Expected behavior
Pod is created with the necessary volume mounts with /dev/sgx_enclave each time.

System:

OS version: Red Hat OpenShift 4.14, 4.15
Kernel version: RH kernel versions based on 5.14
Device plugins version: v0.28 from OpenShift OperatorHub
Hardware info: 4th and 5th Gen Intel Xeon Processors with QAT

Additional context
It might be an edge-case similar to the issue noticed for nVidia GPU (this specific comment): NVIDIA/k8s-device-plugin#291 (comment).
Similar as in the linked issue, the Pod somehow gets started before sgx.intel.com/epc can be succesfully allocated and mounted. Very curious why that happens.

mythi · 2024-03-22T10:36:04Z

@Feelas can you check the pod's resources when the error happens?

I believe you only get

sgx.intel.com/epc: 1Gi

and not

sgx.intel.com/epc: 1Gi
sgx.intel.com/enclave 1

which then suggests the operator has failed to mutate the pod. you could check the API server logs for more details why the operator/webhook had failed

tkatila · 2024-03-22T14:04:35Z

Not sure if it's related, but we've seen somewhat similar with e2e fpga. Seems like a Pod is scheduled with the generic resource name (Pod goes to Pending) and when the operator comes fully online it doesn't detect the Pending pod. Pod is left in Pending state.

Feelas · 2024-03-25T12:12:53Z

@mythi we've managed to trigger the error again.

As a result:

sgx.intel.com/enclave was indeed missing from the resources section of the pod and thus Gramine failed.
Pod was stuck in CrashLoopBackOff for a very long time,
none "admission: Mutated SGX Pod" messages for this specific Pod were present in the logs (controller-manager Pod)
after DELETEing the Pod, instantly there are the logs with "admission: Mutated SGX Pod" for our workload and it starts correctly.

Given my limited knowlede of webhooks:

I took a quick look at webhook triggers and it looks like it is hooking on the CREATE operation of the Pod (webhooks.rules for SGX MutatingWebhookConfiguiration).
Will a restarting Pod match a "pods Create" rule for a Webhook? Maybe that is the reason for this behaviour.

If yes, it looks like the webhook might not handle "retroactively" mutating Pods w/ SGX requests that were created on the cluster before the webhook itself is created and registered, unless of course I am missing something about how the webhooks work.

mythi · 2024-03-25T13:25:58Z

If yes, it looks like the webhook might not handle "retroactively" mutating Pods w/ SGX requests that were created on the cluster before the webhook itself is created and registered, unless of course I am missing something about how the webhooks work.

I'd like to understand why the mutation gets missed on CREATE because that's what starts CrashLoopBackOff.

Feelas · 2024-03-25T14:13:17Z

If yes, it looks like the webhook might not handle "retroactively" mutating Pods w/ SGX requests that were created on the cluster before the webhook itself is created and registered, unless of course I am missing something about how the webhooks work.

I'd like to understand why the mutation gets missed on CREATE because that's what starts CrashLoopBackOff.

Same, the only idea that would make sense would be if Retries for spawning the Pod wouldn't throw a CREATE.

So the timeline would look like this, in my mind:
Pod gets created (causes CREATE) and is trying to spawn -> Webhook is initialized -> Pod is stuck restarting and not throwing CREATE events -> Webhook gets initialized, registered & responds but no CREATE events are passed from the failing Pod.

And by this point it would boil down to the question "when can we be sure that the CREATE hook is already up"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGX-enabled pods sometimes get created without SGX device mounted #1695

SGX-enabled pods sometimes get created without SGX device mounted #1695

Feelas commented Mar 22, 2024

mythi commented Mar 22, 2024

tkatila commented Mar 22, 2024

Feelas commented Mar 25, 2024

mythi commented Mar 25, 2024

Feelas commented Mar 25, 2024 •

edited

SGX-enabled pods sometimes get created without SGX device mounted #1695

SGX-enabled pods sometimes get created without SGX device mounted #1695

Comments

Feelas commented Mar 22, 2024

mythi commented Mar 22, 2024

tkatila commented Mar 22, 2024

Feelas commented Mar 25, 2024

mythi commented Mar 25, 2024

Feelas commented Mar 25, 2024 • edited

Feelas commented Mar 25, 2024 •

edited