Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle multiple concurrent Azure DNS01 challenges for the same FQDN #6351

Merged
merged 5 commits into from May 14, 2024

Conversation

eplightning
Copy link
Contributor

@eplightning eplightning commented Sep 18, 2023

Pull Request Motivation

This PR fixes a race condition in Azure DNS that happens when issuing certificates concurrently in different clusters as described in #6229.

Similar issues were already fixed for AWS/GCP/Cloudflare in other PRs:

To make this work in Azure DNS, this PR makes use of optimistic concurrency control supported by Azure DNS API via standard If-Match, If-None-Match, Etag headers: https://learn.microsoft.com/en-us/rest/api/dns/record-sets/create-or-update?tabs=HTTP

Current version simply unconditionally overwrites any changes. New version introduced in this PR does the following for all updates:

  1. Try to fetch an already existing matching record. If found, Azure DNS will return the record set and its version (Etag).

  2. Modify record set as necessary: Append if doesn't exist on Present, remove if exists on CleanUp

  3. Call appropriate Azure API:

  • If after modifications no records remain, call Delete API. Etag will ensure we will only delete record set if no modifications happened concurrently. Call is skipped if step 1 didn't find a matching record set.

  • If there are records and a matching record set was not found: call CreateOrUpdate API with If-None-Match: *. This will ensure we only create and not update (otherwise we would overwrite other changes).

  • If there are records and we found a matching record set: call CreateOrUpdate API with If-Match: $Etag. Call will only succeed if no other updates happened concurrently.

  1. If any precondition fail, Azure DNS will return an error which will be eventually retried by cert-manager.

Kind

/kind bug

Release Note

Fix ACME issuer being stuck waiting for DNS propagation when using Azure DNS with multiple instances issuing for the same FQDN

@jetstack-bot jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 22, 2023
@irbekrm
Copy link
Collaborator

irbekrm commented Sep 26, 2023

/ok-to-test

@eplightning
Copy link
Contributor Author

/retest

1 similar comment
@eplightning
Copy link
Contributor Author

/retest

@jetstack-bot jetstack-bot added the dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. label Nov 13, 2023
@jetstack-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jakexks for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added area/acme Indicates a PR directly modifies the ACME Issuer code area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code labels Nov 13, 2023
@eplightning
Copy link
Contributor Author

Alright rebasing seems to have fixed the bot and tests not passing

@jetstack-bot jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2024
@jetstack-bot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2024
@eplightning
Copy link
Contributor Author

/remove-lifecycle stale

I'll rebase this eventually ...

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2024
@oliverplann
Copy link

@eplightning @irbekrm @maelvls Sorry for tagging you, but is there any update on this whatsoever? The PR was created quite a long time ago and similar fixes have been merged across other DNS providers.

There has been quite a lot of mentions on this e.g.:

I can confirm this also happens on a single cluster attempting to issue either multiple certificates or multiple dnsNames using the same less-priviliged domain in Azure.

I'm not a GO expert, but I have Azure infrastructure available to help you test, if you instruct me how. Currently I'm installing cert-manager through the official helm chart, but I imagine I'll have to do something else in this case.

Furthermore, is there any known workaround as long as this issue persists?

// Oliver

Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
@cert-manager-prow cert-manager-prow bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 10, 2024
Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
@cert-manager-prow cert-manager-prow bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels May 10, 2024
Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
@eplightning
Copy link
Contributor Author

eplightning commented May 10, 2024

@oliverplann I rebased the PR, so whatever happens next is up to this project's maintainers.

I'm not aware of any workarounds, whenever this issue happens either a manual DNS update is required or the certificate issuance needs to be somehow attempted again - probably by removing CertificateRequest? (I don't really remember at this point honestly).

@cert-manager-prow cert-manager-prow bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels May 10, 2024
Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
@cert-manager-prow cert-manager-prow bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels May 10, 2024
@eplightning
Copy link
Contributor Author

/retest

@oliverplann
Copy link

@oliverplann I rebased the PR, so whatever happens next is up to this project's maintainers.

I'm not aware of any workarounds, whenever this issue happens either a manual DNS update is required or the certificate issuance needs to be somehow attempted again - probably by removing CertificateRequest? (I don't really remember at this point honestly).

Thank you for acting quick. I hope the maintainers will prioritize this...

@oliverplann
Copy link

oliverplann commented May 13, 2024

TLDR
I'll test this PR and have also found a suitable workaround:
Using Helm: Set the value maxConcurrentChallenges: 1 (Default: 60)
Using kubectl apply: Edit deployment.yaml and set --max-concurrent-challenges=1

Original
I've reached out to the maintainers on the Kubernetes slack and they've instructed me how to test this PR, which I will get done today or tomorrow.

Similarly to a multi-cluster setup, if you are using a delegation domain (source) to issue multiple certificates or dnsNames e.g. _acme-challenge.less-privileged.example.org cert-manager will reuse the same TXT record for every challenge in the cluster. This currently results in an edge case where each challenge will randomly overwrite the value of the TXT record, before the previous challenge succeeded. This in turn puts most, if not all of the challenges in a pending state.

As long as this bug exists, it's facilitated through the default value of maxConcurrentChallenges.

I've found a suitable workaround until this PR has been adressed. Sharing it here in case anyone else needs a working cert-manager without manually interacting with cert-manager resources or Azure DNS:

Using Helm: Set the value maxConcurrentChallenges: 1 (Default: 60)
Using kubectl apply: Edit deployment.yaml and set --max-concurrent-challenges=1

Keep in mind that It will take significantly longer for cert-manager to validate challenges and issue certificates this way, depending on the amount of certs and dnsNames in your environment, but it works!
Also this shouldn't create issues renewing existing certs, as cert-manager will wait until every challenge for a given certificate is Valid before it changes the Ready state of the certificate.

Here's an example of what challenges will look like with maxConcurrentChallenges set to 1, handling 1 at a time:

example-com-certificate-1-2696367434-4084983421             example.com         4m55s
example-org-certificate-1-2696367434-410157576              k8s.example.org     4m55s
example-org-certificate-1-2696367434-413199771              tst.k8s.example.org 4m55s
example-net-certificate-1-1612508506-2239642643   pending   k8s.example.net     4m56s
example-net-certificate-1-1612508506-2283501741   valid     example.net         4m56s
example-net-certificate-1-1612508506-4097195913   valid     example.net         4m56s

@oliverplann
Copy link

Hello @eplightning

I've tested your PR using AKS & Azure DNS and it works perfectly! Thank you very much for your contribution.
I reached out to @SgtCoDFish on Slack and he'll look into merging this when he has time.

My findings below.

Challenges are now adding separate entries/values to the _acme-challenge TXT record in Azure DNS, as opposed to overwriting the value each time:

txt-record

kubectl get challenges shows that multiple challenges are being evaluated concurrently and their state are steadily changing to Valid:

NAME                                               STATE     DOMAIN
example-com-certificate-1-196084347-1516893050     pending   tst.k8s.example.com                     
example-com-certificate-1-196084347-2394518141     valid     k8s.example.com          
example-com-certificate-1-196084347-3341303458               example.com                       
example-org-certificate-1-196084347-3621474682     pending   tst.k8s.example.org
example-org-certificate-1-196084347-4044510119     pending   k8s.example.org
example-org-certificate-1-196084347-4127470729     valid     example.org            
example-net-certificate-1-2696367434-2908286604    pending   tst.k8s.example.net       
example-net-certificate-1-2696367434-3121967868    valid     k8s.example.net      
example-net-certificate-1-2696367434-3152871262    valid     example.net

Secrets & certificates are created:

NAME                          TYPE                 DATA   AGE
example-com-tls-secret        kubernetes.io/tls    2      93m
example-net-tls-secret        kubernetes.io/tls    2      93m
example-org-tls-secret        kubernetes.io/tls    2      93m

NAME                          READY   SECRET                    AGE
example-com-certificate       True    example-com-tls-secret    93m
example-net-certificate       True    example-net-tls-secret    93m
example-org-certificate       True    example-org-tls-secret    93m

Please let me know if anything is missing before we can merge this PR :-)

// Oliver

@SgtCoDFish SgtCoDFish added this to the 1.15 milestone May 14, 2024
Copy link
Member

@SgtCoDFish SgtCoDFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

We're not really able to easily test Azure changes so big thanks to @oliverplann for testing this! And of course, thank you to everyone else involved for your patience and your efforts here. This should make it into 1.15 for sure.

@cert-manager-prow cert-manager-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2024
@cert-manager-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SgtCoDFish

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cert-manager-prow cert-manager-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2024
@inteon inteon modified the milestones: 1.15, 1.16 May 14, 2024
@cert-manager-prow cert-manager-prow bot merged commit 7db560c into cert-manager:master May 14, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code area/acme Indicates a PR directly modifies the ACME Issuer code dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. ok-to-test release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants