Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: problems setting up Marblerun Quickstart SGX on local machine #327

Open
vdae opened this issue Dec 10, 2022 · 6 comments
Open

Comments

@vdae
Copy link

vdae commented Dec 10, 2022

Hello, edgelessys team,

I'm having problems trying to set up sgx enabled marblerun in a local minikube cluster for running the emojivoto example.

My System runs Ubuntu LTS 20.04 and supports SGX and FLC (output from Gramines is-sgx-available tool):

SGX supported by CPU: true
SGX1 (ECREATE, EENTER, ...): true
SGX2 (EAUG, EACCEPT, EMODPR, ...): true
Flexible Launch Control (IA32_SGXPUBKEYHASH{0..3} MSRs): true
SGX extensions for virtualizers (EINCVIRTCHILD, EDECVIRTCHILD, ESETCONTEXT): false
Extensions for concurrent memory management (ETRACKC, ELDBC, ELDUC, ERDINFO): false
CET enclave attributes support (See Table 37-5 in the SDM): false
Key separation and sharing (KSS) support (CONFIGID, CONFIGSVN, ISVEXTPRODID, ISVFAMILYID report fields): false
Max enclave size (32-bit): 0x80000000
Max enclave size (64-bit): 0x1000000000
EPC size: 0x5e00000
SGX driver loaded: true
AESMD installed: true
SGX PSW/libsgx installed: true

Local Intel PCCS and AESMD services (sgx-aesm-service and sgx-dcap-pccs from intel sgx apt repo) are running and all other libsgx-* packages from apt are installed. “USE_SECURE_CERT”=FALSE is set in /etc/sgx_default_qcnl.conf. System is on the latest BIOS, so the TCB should not be a Problem.

The minikube cluster is set up like the one from your tf-training repo:

minikube start --mount --mount-string /var/run/aesmd/:/var/run/aesmd --memory 24576

minikube ssh
sudo mkdir /dev/sgx
sudo ln -s /dev/sgx_enclave /dev/sgx/enclave
sudo ln -s /dev/sgx_provision /dev/sgx/provision

minikube kubectl -- apply -f https://github.com/jetstack/cert-manager/releases/download/v1.3.3/cert-manager.yaml
minikube kubectl -- apply -k https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/sgx_plugin/overlays/epc-nfd/?ref=v0.23.0

When checking with kubectl the sgx device plugin is running, and marblerun precheck validates this:

user@nuc:~$ minikube kubectl -- describe node | grep sgx.intel.com
                    nfd.node.kubernetes.io/extended-resources: sgx.intel.com/epc
  sgx.intel.com/enclave:    110
  sgx.intel.com/epc:        98566144
  sgx.intel.com/provision:  110
  sgx.intel.com/enclave:    110
  sgx.intel.com/epc:        98566144
  sgx.intel.com/provision:  110
  sgx.intel.com/enclave    1           1
  sgx.intel.com/epc        10Mi        10Mi
  sgx.intel.com/provision  1           1
  
user@nuc:~$ marblerun precheck
  Cluster supports SGX on 1 node
  To install MarbleRun run [marblerun install]

I tried both, starting marblerun using the --dcap-qpl intel flag, and without:

When running marblerun install --dcap-secure-cert FALSE --dcap-qpl intel the coordinator fails when invoking main:

[meshentry] invoking premain
[meshentry] invoking main
{"level":"info","ts":1670679324.0814278,"caller":"coordinator/run.go:53","msg":"starting coordinator","version":"0.6.1","commit":"2233c5e5892faf16d63f26e6439a1cfc15e8cd81"}
{"level":"info","ts":1670679324.0814278,"caller":"coordinator/run.go:84","msg":"creating the Core object"}
{"level":"info","ts":1670679324.0814278,"caller":"core/core.go:137","msg":"loading state"}
{"level":"info","ts":1670679324.085428,"caller":"core/core.go:175","msg":"No sealed state found. Proceeding with new state."}
{"level":"info","ts":1670679324.0934281,"caller":"core/core.go:331","msg":"generating quote"}
[get_platform_quote_cert_data ../qe_logic.cpp:378] Error returned from the p_sgx_get_quote_config API. 0xe019
{"level":"fatal","ts":1670679324.169431,"caller":"coordinator/run.go:91","msg":"Cannot create Coordinator core","error":"failed to get quote: OE_PLATFORM_ERROR","stacktrace":"main.run\n\tgithub.com/edgelesssys/marblerun/cmd/coordinator/run.go:91\nmain.main\n\tgithub.com/edgelesssys/marblerun/cmd/coordinator/enclavemain.go:29\nmain.invokemain\n\tgithub.com/edgelesssys/marblerun/cmd/coordinator/invokemain.go:15\n_cgoexp_2210a7c57b2b_invokemain\n\t_cgo_gotypes.go:42\nruntime.cgocallbackg1\n\truntime/cgocall.go:314\nruntime.cgocallbackg\n\truntime/cgocall.go:233\nruntime.cgocallback\n\truntime/asm_amd64.s:971"}
[erthost] loading enclave ...
[erthost] entering enclave ...
ERROR: dcap_quoteprov: [ERROR]: [QPL] Failed to get quote config. Error code is 0xb006
ERROR: quote3_error_t=SGX_QL_NETWORK_ERROR
 (oe_result_t=OE_PLATFORM_ERROR) [openenclave-src/host/sgx/sgxquote.c:oe_sgx_qe_get_target_info:706]
ERROR: SGX Plugin _get_report(): failed to get ecdsa report. OE_PLATFORM_ERROR (oe_result_t=OE_PLATFORM_ERROR) [openenclave-src/enclave/sgx/attester.c:_get_report:324]

When running marblerun install --dcap-secure-cert FALSE (so it should be using the azure qpl) the coordinator starts, but the marbles fail:

Coordinator log:

When marble tries to register:

{"level":"info","ts":1670681725.410313,"caller":"core/marbleapi.go:55","msg":"Received activation request","MarbleType":"web"}
{"level":"info","ts":1670681725.6423116,"caller":"zap/options.go:212","msg":"finished unary call with code Unauthenticated","grpc.start_time":"2022-12-10T14:15:25Z","system":"grpc","span.kind":"server","grpc.service":"rpc.Marble","grpc.method":"Activate","peer.address":"172.17.0.1:41097","error":"rpc error: code = Unauthenticated desc = invalid quote","grpc.code":"Unauthenticated","grpc.time_ms":235.998}
{"level":"info","ts":1670681725.6743114,"caller":"zap/grpclogger.go:92","msg":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}
ERROR: dcap_quoteprov: [ERROR]: HTTP error (404)
ERROR: dcap_quoteprov: [ERROR]: Encountered CURL error 22 in curl_easy_perform
ERROR: dcap_quoteprov: [ERROR]: curl error thrown, error code: 16: curl_easy_perform
ERROR: dcap_quoteprov: [ERROR]: Error fetching TCB Info: 57371
ERROR: Failed to get certificate quote verification collateral information. OE_QUOTE_PROVIDER_CALL_ERROR (oe_result_t=OE_QUOTE_PROVIDER_CALL_ERROR) [openenclave-src/common/sgx/endorsements.c:oe_get_sgx_endorsements:405]

Marble log:

EGo v0.3.2 (7aa02feec03da36f984a335ddd58c85cac5cedaa)
[erthost] loading enclave ...
[erthost] entering enclave ...
[PreMain] 2022/12/10 14:09:21 starting PreMain
[PreMain] 2022/12/10 14:09:21 fetching env variables
[PreMain] 2022/12/10 14:09:21 loading TLS Credentials
[PreMain] 2022/12/10 14:09:21 loading UUID
[PreMain] 2022/12/10 14:09:21 found UUID: 92ae931e-d476-4784-adf4-df4f8bffcd31
[PreMain] 2022/12/10 14:09:21 generating CSR
[PreMain] 2022/12/10 14:09:21 generating quote
Azure Quote Provider: libdcap_quoteprov.so [ERROR]: Could not retrieve environment variable for 'AZDCAP_DEBUG_LOG_LEVEL'
[PreMain] 2022/12/10 14:09:22 activating marble of type web
panic: rpc error: code = Unauthenticated desc = invalid quote
goroutine 17 [running, locked to thread]:
main.ert_ego_premain(0x7fbbc17bf368, 0x7fbbc17bf360, 0x41, 0x7fbbc2d88430, 0x7fbbc2d88390)
 ego/premain/main.go:31 +0x27b

To check the qpl I tried running Intels QuoteGenerationSample and it's working, but this is running on the host os and not in the minikube cluster.

sgx_qe_set_enclave_load_policy is valid in in-proc mode only and it is optional: the default enclave load policy is persistent: 
set the enclave load policy as persistent:succeed!

Step1: Call sgx_qe_get_target_info:succeed!
Step2: Call create_app_report:succeed!
Step3: Call sgx_qe_get_quote_size:succeed!
Step4: Call sgx_qe_get_quote:succeed!cert_key_type = 0x5
sgx_qe_cleanup_by_policy is valid in in-proc mode only.

 Clean up the enclave load policy:succeed!

The ego remote attestation sample is working locally, too (using the ego snap).

Do you have any ideas what could cause these problems?
Looking forward to your reply.

@daniel-weisse
Copy link
Member

Hey there

When running with the Intel QPL you will probably also need to set the --dcap-pccs-url flag to point to your PCCS.
By default it is set to https://localhost:8081/sgx/certification/v3/ which is most likely not reachable from inside your Kubernetes pod.

You can try running the PCCS as a Pod in minikube and configure the Coordinator to use the Pod as its PCCS.
There is also a host access option for minikube which looks like it could be used to enable the Coordinator Pod to use a PCCS running on the host system.

The reason why the Coordinator is able to start when using the Azure qpl is that you can use the Azure qpl to generate quotes. However, you will not be able to verify any as long as you are not running inside a VM provided by Azure.
This is why you are seeing an error when the Marbles try to join.

@vdae
Copy link
Author

vdae commented Dec 12, 2022

Hi @daniel-weisse,

thank you for your reply.

You were correct, the PCCS was not reachable from inside the minikube cluster.

I reinstalled the PCCS and used this setting during installation:

Set the PCCS service to accept local connections only? [Y] (Y/N) :N

After that the PCCS was reachable using https://host.minikube.internal:8081/sgx/certification/v3/ as the URL.

Now the Coordinator starts when using the intel qpl (marblerun install --dcap-secure-cert FALSE --dcap-qpl intel --dcap-pccs-url https://host.minikube.internal:8081/sgx/certification/v3/), but unfortunately marbles still fail to register.

The error seems to be similar to what happens when not setting the intel qpl when installing marblerun.

Also, after installing emojivoto using helm, the coordinator starts crashing and restarting.

The following snippets are the logs of the Coordinator and a Marble.

Coordinator log:

{"level":"info","ts":1670837192.2871456,"caller":"core/marbleapi.go:55","msg":"Received activation request","MarbleType":"voting-svc"}
{"level":"info","ts":1670837195.4311583,"caller":"zap/options.go:212","msg":"finished unary call with code Unauthenticated","grpc.start_time":"2022-12-12T09:26:32Z","system":"grpc","span.kind":"server","grpc.service":"rpc.Marble","grpc.method":"Activate","peer.address":"172.17.0.1:22733","error":"rpc error: code = Unauthenticated desc = invalid quote","grpc.code":"Unauthenticated","grpc.time_ms":3144.012}
{"level":"info","ts":1670837195.6631591,"caller":"zap/grpclogger.go:92","msg":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}
2022/12/12 09:26:39 http: TLS handshake error from 172.17.0.1:60158: EOF
2022/12/12 09:26:39 http: TLS handshake error from 172.17.0.1:60164: EOF

Marble log:

EGo v0.3.2 (7aa02feec03da36f984a335ddd58c85cac5cedaa)
[erthost] loading enclave ...
[erthost] entering enclave ...
[PreMain] 2022/12/12 09:27:54 starting PreMain
[PreMain] 2022/12/12 09:27:54 fetching env variables
[PreMain] 2022/12/12 09:27:54 loading TLS Credentials
[PreMain] 2022/12/12 09:27:54 loading UUID
[PreMain] 2022/12/12 09:27:54 found UUID: e9a86b30-c9e2-4e10-bd4c-b5550f842511
[PreMain] 2022/12/12 09:27:54 generating CSR
[PreMain] 2022/12/12 09:27:54 generating quote
Azure Quote Provider: libdcap_quoteprov.so [ERROR]: Could not retrieve environment variable for 'AZDCAP_DEBUG_LOG_LEVEL'
[PreMain] 2022/12/12 09:27:55 activating marble of type emoji-svc
panic: rpc error: code = Unauthenticated desc = invalid quote
goroutine 17 [running, locked to thread]:
main.ert_ego_premain(0x7f04c17bf368, 0x7f04c17bf360, 0x41, 0x7f04c2e59430, 0x7f04c2e59390)
 ego/premain/main.go:31 +0x27b

Do I have to configure the marbles to use intel qpl when installing the emojivoto example? I already tried two things:

The emojivoto manifest contains an infrastructures block, which is only mentioned in the docs for future releases. I tried removing this, but this did not do anything.

"Infrastructures": {
		"Azure": {}
	},

I also tried setting the DCAP library in the Marbles helm chart:

In marble helm charts:

- name: DCAP_LIBRARY
  valueFrom:
    configMapKeyRef:
      name: oe-config
      key: DCAP_LIBRARY

In sgx-values.yaml (since oe-config seems to parse this block {{- toYaml .Values.simulation | nindent 2}}):

simulation:
  OE_SIMULATION: "0"
  DCAP_LIBRARY: intel

But this didn't do anything, either.

@daniel-weisse
Copy link
Member

Now the Coordinator starts when using the intel qpl, but unfortunately marbles still fail to register.

I think this happens because the emojivoto container images were not built with the Intel QPL.
I will test this and upload new images.

Also, after installing emojivoto using helm, the coordinator starts crashing and restarting.

Can you check why this is happening? Could be due to resource exhaustion on the host.
Does kubectl describe pods -n marblerun marblerun-coordinator or kubectl logs -n marblerun deployments/marblerun-coordinator give any clues to what is happening?

@vdae
Copy link
Author

vdae commented Dec 12, 2022

I think this happens because the emojivoto container images were not built with the Intel QPL.
I will test this and upload new images.

Thank you very much.

Can you check why this is happening? Could be due to resource exhaustion on the host.
Does kubectl describe pods -n marblerun marblerun-coordinator or kubectl logs -n marblerun deployments/marblerun-coordinator give any clues to what is happening?

Resource exhaustion seems to be a plausible cause. The Coordinator starts crashing when the Cluster constantly tries to restart the three emojivoto pods and stops crashing, when the CrashLoopBackoff Intervals of the emojivoto pods get larger.

I have attached the full logs of a crashed coordinator instance and the full pod description:

minikube kubectl -- describe pods -n marblerun marblerun-coordinator:
kubectl_describe_coordinator.txt

minikube kubectl -- logs -n marblerun deployments/marblerun-coordinator --previous:
coordinator_crashlog.txt

Edit: It seems to crash (almost) every time when emojivoto starts, during those restarts CPU Utilization is at 100%. I will keep this running for a while and check again later.

NAMESPACE                NAME                                       READY   STATUS             RESTARTS       AGE
emojivoto                emoji-0                                    0/1     CrashLoopBackOff   10 (82s ago)   36m
emojivoto                vote-bot-84d5989b6d-hppgh                  1/1     Running            0              36m
emojivoto                voting-0                                   0/1     CrashLoopBackOff   10 (80s ago)   36m
emojivoto                web-0                                      0/1     CrashLoopBackOff   10 (80s ago)   36m
marblerun                marble-injector-868986bcc4-hv8f4           1/1     Running            0              41m
marblerun                marblerun-coordinator-6f6dfb6866-8n6dg     1/1     Running            9 (78s ago)    41m

Edit2: I kept the cluster running, the coordinator does not crash every time emojivoto restarts. But since CPU Utilization hits 100% whenever emojivoto restarts and the system I'm testing on has a pretty weak CPU (Intel NUC) resource exhaustion is probably the most plausible cause for this.

emojivoto                emoji-0                                    0/1     CrashLoopBackOff   173 (5m16s ago)   16h
emojivoto                vote-bot-84d5989b6d-hppgh                  1/1     Running            0                 16h
emojivoto                voting-0                                   0/1     CrashLoopBackOff   174 (4m58s ago)   16h
emojivoto                web-0                                      0/1     CrashLoopBackOff   173 (4m51s ago)   16h
marblerun                marble-injector-868986bcc4-hv8f4           1/1     Running            0                 16h
marblerun                marblerun-coordinator-6f6dfb6866-8n6dg     1/1     Running            113 (15m ago)     16h

@daniel-weisse
Copy link
Member

daniel-weisse commented Dec 13, 2022

I have uploaded new container images for emojivoto that were successfully verified by a Coordinator configured to use the Intel QPL.
You can find the changes on the feat/intel-qpl branch in the emojivoto repo.
I have updated the repo readme as well.

As for the crashes, I was not able to find anything in the logs. Maybe a kubectl describe nodes minikube shows something helpful?

For reference this is how I started my cluster:

# Start minikube with 4 cpus, 8GB memory, and create the /dev/sgx sym links
# emojivoto and MarbleRun do not need the /var/run/aesmd socket, so it can be skipped
minikube start --cpus=4 --memory=8G --mount --mount-string=/dev/sgx:/dev/sgx

# Install cert-manager v1.10.1
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.10.1/cert-manager.yaml

# Install sgx webhook
kubectl apply -k https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/sgx_plugin/overlays/epc-nfd/?ref=v0.23.0

@vdae
Copy link
Author

vdae commented Dec 13, 2022

I have uploaded new container images for emojivoto that were successfully verified by a Coordinator configured to use the Intel QPL.
You can find the changes on the feat/intel-qpl branch in the emojivoto repo.
I have updated the repo readme as well.

Thank you for your help! With the changes made in the intel-qpl branch the Marbles are registering.

As for the crashes, I was not able to find anything in the logs. Maybe a kubectl describe nodes minikube shows something helpful?

Unfortunately the node description does not provide any helpul information about the reason why the coordinator was crashing.

The Coordinator still crashes when emojivoto is starting, but it is able to reload the state after its restarts. After all Marbles are registered everything is working and the coordinator doesn't restart anymore. Like before, the coordinator logs don't really provide information about why this is happening.

For the sake of completeness i have attached the coordinators full log.
coordinator_crash_register.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants