Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RFC8555 (ACME spec) § 8.2 #168

Open
dcow opened this issue Jan 29, 2020 · 10 comments · May be fixed by #242
Open

Implement RFC8555 (ACME spec) § 8.2 #168

dcow opened this issue Jan 29, 2020 · 10 comments · May be fixed by #242
Assignees

Comments

@dcow
Copy link
Contributor

dcow commented Jan 29, 2020

Section 8.2 of the ACME spec details exactly how client and server retry should be handled during a challenge validation. We should implement this part of the spec. Namely, retry state needs to be included in the challenge resource, client requests should be throttled, and we should set appropriate retry-after headers on the challenge resource response and include error information about the result of each attempted validation.

More details:
#162 (comment)

@sourishkrout sourishkrout changed the title Implement RFC8555 § 8.2 Implement RFC8555 (ACME spec) § 8.2 Jan 29, 2020
@wesgraham
Copy link

wesgraham commented Feb 4, 2020

Hey all :)
Quick question on the desired approach:
I'm about to modify both validate() calls (on http01 and dns01 challenges) in acme/challenge.go to account for retries. The simple approach I'm thinking of would be to make a blocking call to time.sleep() in between attempts. I recognize this could also be made non-blocking (using timer tasks/channels) - however it would likely require additional changes to the module. Any thoughts here?

@dcow
Copy link
Contributor Author

dcow commented Feb 4, 2020

We need to be able to cancel future retries if a client request to validate comes in while we'd otherwise be sleeping. As far as I can tell, that pushes things toward the timer approach so we can cancel pending retries and start over when new requests come in. If that's overly complicated, I guess sleeping routines could check whether they've been canceled when they wake up or something, but after 3 minutes of thought, I think the timer approach is still preferable. cc @maraino or @dopey for a second opinion.

@wesgraham
Copy link

Makes sense - in that case I'll utilize the timer approach. Will also modify ValidateChallenge() in acme/authority.go to actively read retry messages generated by the timer task. Will submit a PR by end of week.

@dcow
Copy link
Contributor Author

dcow commented Apr 30, 2020

I'm picking this up in #242.

@tomasfreund
Copy link

Hi, any updates on this? Some of our ACME challenges occasionally get stuck in invalid state because of (probably) transient DNS errors or propagation delays. If this was implemented, it would probably fix a lot of our problems. Thanks

@dopey dopey added the needs triage Waiting for discussion / prioritization by team label Sep 7, 2021
@dopey
Copy link
Contributor

dopey commented Sep 7, 2021

Hey @tomasfreund we ended up tabling this without merging. Then the implementer moved on to a different project, and we've pretty much let it sit since then because there hasn't been much interest from the community.

To my recollection, my main issue with this PR (in the state that we left it) was that it was "complicated". It required state to be saved between executions of step-ca. The way it was being stored, and then refreshed on start up, wound up being leaked into the ca.json, which I wasn't happy with. Basically, the PR needs more work before we could merge it.

I'll bring this up at our weekly triage meeting, but my guess is that we don't have bandwidth to prioritize this issue at this time, given that the workaround is probably just to retry the request.

@dcow
Copy link
Contributor Author

dcow commented Sep 8, 2021

If you're okay with this only being guaranteed to work in non "HA" contexts (single instance of step-ca) then you should be able to take out the ordinal (implication being that any updates to the retry state would need to be transactional and handled by the backing store and the transaction would need to span the length of the validation attempt) or just leave it in and pin it to 0 in the code and hide it from the config json for the time being.

Beyond that, I believe the remaining work is just one more pass to address outstanding comments on the PR (there are some minor but blocking logic updates needed) and then a pass on the tests to be more comprehensive. I believe I updated the existing tests which cover the changes made in the PR and they generally pass. However, the PR also involves converting some cases where the server would have returned a 4xx into a 2xx + error. I discovered there was no existing coverage for those specific scenarios and recall we wanted to add some coverage as well as run a smoke test using existing clients to make sure they properly handle the updated behavior.

That's about all I remember after a quick skim of #242.

@dopey
Copy link
Contributor

dopey commented Sep 8, 2021

Agree w/ everything @dcow said above.

I think we weren't sure at the time, and still aren't, if we wanted folks trying to use an HA setup to need extra configuration. We're no further along in answering that question.

@dopey
Copy link
Contributor

dopey commented Sep 15, 2021

Hey @tomasfreund we had the opportunity to discuss this issue a bit this morning and came away with a few questions / comments.

  1. There are some ACME clients that will wait to finalize (or request verification) until they've first verified that the DNS probe is successful. Which ACME client are you using, and is it possible that switching clients would be an easy workaround?

  2. We're trying to gauge how many users are affected by similar issues. For those affected, please drop a thumbs up on the issue (or this comment) or leave a comment. I know it's silly, but it helps us with prioritization. Thanks!

@tomasfreund
Copy link

Hi, we are using cert-manager which tries to verify the DNS but our dns-01 challenges still get stuck sometimes.
We get There was a problem with a DNS query during identifier validation in the cert-manager challenge and "error":{"type":"urn:ietf:params:acme:error:dns","detail":"There was a problem with a DNS query during identifier validation"} on the CA side.
We have found that creating a new challenge or restarting cert-manager helps most of the time.
I suspect that this might have something to do with DNS propagation - even if cert-manager checks the existence of the DNS record with a lookup, it does not guarantee that is has already propagated to the DNS server the CA is using. If there is no retry there, I can understand why it would fail.
This is a really big issue for us because having to monitor and manually retry challenges defeats the purpose of automatic certificate management.

@dopey dopey removed the needs triage Waiting for discussion / prioritization by team label Oct 13, 2021
@maraino maraino removed this from the v0.15.0 milestone Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants