Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renew device identity certificate automatically with EST #6911

Open
0xarmingithub opened this issue Feb 14, 2023 · 9 comments
Open

Renew device identity certificate automatically with EST #6911

0xarmingithub opened this issue Feb 14, 2023 · 9 comments

Comments

@0xarmingithub
Copy link

In our production edge devices, we use EST for automatic device identity certificate issuance and renewal. My questions are mainly related to the following part, which I couldn't really find answers in documents.

[provisioning.attestation.identity_cert.auto_renew]
rotate_key = true
threshold = "80%"
retry = "4%"
  • What are the pros and cons of rotating private key? our edge devices are out in the field and we won't have any remote access to them. So it is of the utmost importance to lower the risk of not being able to renew the certificate.
  • When we set retry = "4%", does it mean that if the renewal fails at 80%, it will try again at 84%, 88%, and so on? Up until 100%? what will happen after 100%?

Besides these questions, we have been thinking about having a recovery plan for edge re-provisioning when certificate renewal doesn't go as expected. If that's the case and there is no disaster recovery plan, we will lose the device and there will be no way to recover it remotely. We are thinking to develop a service on device to monitor the certificate's expire time every 1 hour and compare it with the current time. If the cert is expired and no new cert has been reissued, which means EST renewal didn't go as expected, then the edge runtime should switch to symmetry key immediately. This requires having the derived device key already on the device as backup, which is doable. Does this process sound feasible? What are the best practices to materialize this scenario?

The overall aim is to of course mitigate the risk and have proper automatic recovery plan.

@jlian
Copy link
Member

jlian commented Feb 14, 2023

These are great questions. I'll take a first pass here but might need @gordonwang0 and @eustacea to keep me honest.

What are the pros and cons of rotating private key? our edge devices are out in the field and we won't have any remote access to them. So it is of the utmost importance to lower the risk of not being able to renew the certificate.

Pros: mitigate key compromise risk. Cons: changing the key can be considered as a cert reissuance as opposed to a renewal.

  • When we set retry = "4%", does it mean that if the renewal fails at 80%, it will try again at 84%, 88%, and so on? Up until 100%? what will happen after 100%?

The certificate renewal procedure is based on a timer that tracks the expiration time of each managed certificate.

On startup, each IoT Edge runtime service collects a list of its managed certificates and check each certificate for expiry. Expired and close-to-expiry certificates are renewed immediately. The service then schedules certificate renewal based on the certificate’s expiration time. So, at startup:

image

The renewal timer will then fire based on the renewals scheduled at startup:

image

Because the timer may fail to renew a certificate due to network unavailability or other issues, IoT Edge runtime services also attempt to renew a certificate upon use:

image

For expired device ID cert and no EST connectivity, your modules (including edgeHub) can continue to send messages. And it should continue try to renew the cert as long it's up. But, if the IoT Edge runtime stops, I believe it might fail to restart because the device ID cert might be required at startup...? @gordonwang0 can you confirm?

Besides these questions, we have been thinking about having a recovery plan for edge re-provisioning when certificate renewal doesn't go as expected. If that's the case and there is no disaster recovery plan, we will lose the device and there will be no way to recover it remotely. We are thinking to develop a service on device to monitor the certificate's expire time every 1 hour and compare it with the current time. If the cert is expired and no new cert has been reissued, which means EST renewal didn't go as expected, then the edge runtime should switch to symmetry key immediately. This requires having the derived device key already on the device as backup, which is doable. Does this process sound feasible? What are the best practices to materialize this scenario?

I believe your concern stems from the worst-case scenario where the certificate renewal fails so many times to the point where it's expired and then it's irrecoverable. If that's the case, then the answer depends on the answer to the earlier question.

Also if the EST service is potentially this unreliable, maybe symmetric key provisioning is the better approach? I'm also not sure if this plan where you switch attestation method upon reprovisioning is practical. AFAIK each DPS enrollment (group) can only use one method so if you want to change it on the device side you'll have to use a different enrollment (group), which might cause all sorts of issues with losing twin/deployment type data - data associated with the device identity - on IoT Hub side.

@0xarmingithub
Copy link
Author

Hi @jlian. Thanks for your comprehensive explanation. Appreciate it. Now it's much more clear how the renewal works.

Regarding recovery plan, we already have two DPS enrollment groups, one set with CA and one set with symmetry key. We already tested switching (re-provisioning) same device from CA to Symmetry or vice versa, remotely via OSConfig module by Microsoft. It actually works pretty well. The device maintains its identity, and only authentication fingerprint changes, which means IoT hub and data flow work as expected. In terms of EST server, we already have agreement with a big certificate issuer organization and the whole renewal and reissuance work flawlessly. The worst-case scenario is just to have a recovery plan in place in case for any rare situation the renewal doesn't work.

@jlian
Copy link
Member

jlian commented Feb 15, 2023

Ok, then that seems like a good plan.

@gordonwang0 could you chime in to confirm what happens if the device ID cert is expired, IoT Edge restarts, and renewal attempt fails?

@gordonwang0
Copy link
Contributor

In that case, the runtime will continue to restart and attempt to renew the certificate until it performs a successful renewal.

@0xarmingithub
Copy link
Author

Thanks both for the explanations. Closing the thread.

@0xarmingithub
Copy link
Author

@jlian @gordonwang0
Sorry for popping in with another question. What is the metrics to monitor edgeHub's logs to identify if the renewal was not successful? I mean how can we get notified if the renewal was not successful? It's mainly needed to implement our monitoring and triggering mechanism. Do you have a sneak peak of error in edgeHub/edgeAgent so the metrics can hook into and update device/module twin in IoT hub?

@0xarmingithub 0xarmingithub reopened this Feb 16, 2023
@jlian
Copy link
Member

jlian commented Feb 16, 2023

What is the metrics to monitor edgeHub's logs to identify if the renewal was not successful? I mean how can we get notified if the renewal was not successful? It's mainly needed to implement our monitoring and triggering mechanism. Do you have a sneak peak of error in edgeHub/edgeAgent so the metrics can hook into and update device/module twin in IoT hub?

Good question. Also tagging @micahl, do you know?

@micahl
Copy link
Contributor

micahl commented Feb 16, 2023

We don't currently have a specific metric for cert renewal. If you have a system set up to regularly gather the logs from the device and send to a Log Analytics workspace (e.g. ELMS cloud workflow) then you could devise a query + log-based alert.

@github-actions
Copy link

This issue is being marked as stale because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants