Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACME (RFC 8555) § 8.2 Challenge Retries #242

Open
wants to merge 37 commits into
base: master
Choose a base branch
from
Open

Conversation

dcow
Copy link
Contributor

@dcow dcow commented Apr 30, 2020

Taking over #181. Original issue #162.

@dcow dcow changed the title 8555 § 8.3 Challenge Retries ACME (RFC 8555) § 8.2 Challenge Retries Apr 30, 2020
acme/authority.go Outdated Show resolved Hide resolved
acme/authority.go Outdated Show resolved Hide resolved
acme/authority.go Outdated Show resolved Hide resolved
acme/challenge.go Outdated Show resolved Hide resolved
acme/challenge.go Outdated Show resolved Hide resolved
acme/challenge.go Outdated Show resolved Hide resolved
acme/challenge.go Outdated Show resolved Hide resolved
acme/challenge.go Outdated Show resolved Hide resolved
acme/challenge.go Outdated Show resolved Hide resolved
api/utils.go Outdated Show resolved Hide resolved
@dcow dcow self-assigned this Apr 30, 2020
@dcow dcow requested a review from dopey April 30, 2020 12:29
acme/challenge.go Outdated Show resolved Hide resolved
@dcow
Copy link
Contributor Author

dcow commented May 1, 2020

https://tools.ietf.org/html/rfc8555#section-7.1.6 illustrates the challenge lifecycle. Retries are mentioned:

Note that within the "processing" state, the server may attempt to validate the challenge multiple times (see Section 8.2). Likewise, client requests for retries do not cause a state change. If validation is successful, the challenge moves to the "valid" state; if there is an error, the challenge moves to the "invalid" state.

I read this as "if the server successfully acquires a challenge result (e.g. http resource, dns record, tls cert) but verification fails (the data is not correct), then move to the "invalid" state". This type of "error" is different from a transient network error which should be retried.

So I'm going to sift through all the error logic and bucket things into one of those two definitions. And we'll only retry for transient errors.

Also, once the state is "invalid", I don't think a challenge can be revived. This means the proposed implementation, where a challenge lives in the invalid state between retry attempts, is not correct. The challenge should remain processing until it's either valid or invalid.

edit: s/pending/processing ^

@cpu
Copy link

cpu commented May 1, 2020

Also, once the state is "invalid", I don't think a challenge can be revived. This means the proposed implementation, where a challenge lives in the invalid state between retry attempts, is not correct. The challenge should remain pending until it's either valid or invalid.

@dcow I think you would want it to remain in a "processing" state instead of a "pending" state for failed attempts that support further retries. I think that's the meat of the errata that Rob from Sectico filed against RFC 8555 related to retries, eid5732: https://www.rfc-editor.org/errata/eid5732

(Edit: this part of the spec is a little bit rough. Despite my best efforts it landed in 8555 without anyone ever having implemented it at the time... 😓)

@dcow
Copy link
Contributor Author

dcow commented May 1, 2020

(Edit: this part of the spec is a little bit rough. Despite my best efforts it landed in 8555 without anyone ever having implemented it at the time... 😓)

@cpu thanks for the color. The errata is super useful—figuring out when to consider a challenge invalid is exactly what I've been mulling over. Do you have any insight on:

While the server is still trying, the status of the challenge remains "processing"; it is only marked "invalid" once the server has given up.

from section 8.2? Is the intention that orders are only allowed to be in the processing state for a short amount of time after the initial client request and the server needs to put the kibosh on future client retries by marking the challenge invalid? I'm inclined to let a challenge remain in a processing state indefinitely until an authorization expires or otherwise becomes invalid. I guess one could interpret that behavior as a server that never "gives up" even if it won't automatically retry again unless the client pokes it.

@cpu
Copy link

cpu commented May 2, 2020

from section 8.2? Is the intention that orders are only allowed to be in the processing state for a short amount of time after the initial client request and the server needs to put the kibosh on future client retries by marking the challenge invalid?

@dcow I think an order should only enter into the processing state when the client POSTs the finalization URL to begin issuance. At that point all of the challenges need to be valid for the finalization to start. In practice I suspect most clients expect orders to not remain in the processing state for very long simply because Let's Encrypt's issuance is more or less synchronous with the finalization request (but that's a quirk of one implementation and not a protocol guarantee).

I'm inclined to let a challenge remain in a processing state indefinitely until an authorization expires or otherwise becomes invalid. I guess one could interpret that behavior as a server that never "gives up" even if it won't automatically retry again unless the client pokes it.

I think that makes sense for the challenges 👍 I can't think of a reason that would be a problem.

Hope that helps!

Section 8.2 of RFC 8555 explains how retries apply to the validation
process. However, much is left up to the implementer.

Add retries every 12 seconds for 2 minutes after a client requests a
validation. The challenge status remains "processing" indefinitely until
a distinct conclusion is reached. This allows a client to continually
re-request a validation by sending a post-get to the challenge resource
until the process fails or succeeds.

Challenges in the processing state include information about why a
validation did not complete in the error field. The server also includes
a Retry-After header to help clients and servers coordinate.

Retries are inherently stateful because they're part of the public API.
When running step-ca in a highly available setup with replicas, care
must be taken to maintain a persistent identifier for each instance
"slot". In kubernetes, this implies a *stateful set*.
@dcow
Copy link
Contributor Author

dcow commented May 6, 2020

Remaining TODO:

  • Reschedule retries when the ca starts up (from a crash or otherwise).
  • Test & Tests

@dcow dcow marked this pull request as ready for review May 6, 2020 15:11
acme/api/handler_test.go Outdated Show resolved Hide resolved
@dcow
Copy link
Contributor Author

dcow commented May 7, 2020

Re rescheduling when we come back up. Discussed with @dopey and @maraino this morning. Given our currentl limitation to k/v db, we decided to split this work out into a separate ticket. We may want to wait to address it until we have better support for crafting the type of query we want: select all the challenges where owner = my_ordinal && status == processing && num_retries < max_retries

dcow added 5 commits May 7, 2020 09:27
The comment in acme/authority directs users to this file so put a TODO
in for posterity.
It might make sense to check in the vscode workspace file if we can make
everything relative to the project directory.
Stop execution when the error happens. This was previously a typo.
Prior to validation, we must wrap the base challenge in the correct
concrete challenge type so that we dispatch the correct validation
method.
dcow added 4 commits May 13, 2020 05:04
The linter likes comments on public functions to start with their name,
for some reason...
Include the "Link" and "Location" headers on invalid challenge
resources. An invalid challenge is still a perfectly acceptable
response.
The error message was updated. Make the test should reflect the new
changes.
Also, return early from ValidateChallenge if the challenge is already
valid. Interestingly, we aren't actually testing most of the
ValidateChallenge func, just the early error and return conditions. We
should add some more coverage here.
@dopey dopey requested a review from maraino May 13, 2020 17:02
dcow added 2 commits May 13, 2020 10:56
On the challenge resource, set "Link" and "Location" headers for all
successful requests to the challenge resource.

up := &http01Challenge{hc.baseChallenge.clone()}

url := fmt.Sprintf("http://%s/.well-known/acme-challenge/%s", hc.Value, hc.Token)
Copy link
Contributor

@maraino maraino May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP?

Edit: yes HTTP is ok here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Value and Token url safe? In not you should use url.URL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't new I just moved the line. I'm happy to use a URL if you think it'd be best. I'm pretty sure the token is specified to be url safe and so would be the hostname. But I don't know if we ever sanitize these things.

e := errors.Errorf("no TXT record found at '%s'", record)
up.Error = DNSErr(e).ToACME()
return up, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that this check should go after we do the lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the lookup is here: https://github.com/smallstep/certificates/pull/242/files/5e5a76c3b53e0edc1c654f8b334e651e9b9e3b21#diff-96e563cc3503e2e7e76fad23ad6b0a09R586 (line 586). The following code is just where we try to match one of the returned records with the token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this check so we could distinguish between no dns records and no matched records with the idea being that no records is transient whereas an incorrect record is a failure. We could, however, just consider all mismatches transient, because it's dns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://letsencrypt.org/docs/challenge-types/#dns-01-challenge:

You can have multiple TXT records in place for the same name. For instance, this might happen if you are validating a challenge for a wildcard and a non-wildcard certificate at the same time. However, you should make sure to clean up old TXT records, because if the response size gets too big Let’s Encrypt will start rejecting it.

I think my suggestion to leave the DNS challenge open:

We could, however, just consider all mismatches transient, because it's dns.

might be the best course.

dcow added 3 commits May 13, 2020 19:22
The authority now receives the ordinal in its constructor rather than a
global variable set at package initialization time. The ordinal is
passed via the command line option `--ordinal`.
Since it will ultimately 500 anyway, just return an error.

// The challenge validation process is specific to the type of challenge (dns-01, http-01, tls-alpn-01).
// But, we still pass generic "options" to the polymorphic validate call.
func (a *Authority) validate(ch challenge, jwk *jose.JSONWebKey) (challenge, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have the method name here be validateChallenge or _validateChallenge. Otherwise it seems like it's validation of the authority itself. I appreciate that you can tell from the args, but I was still confused.

if err != nil {
return nil, Wrap(err, "error attempting challenge validation")
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's at least log the error here, otherwise we're failing silently, right?

return
}
switch ch.getStatus() {
case StatusPending:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the challenge id (and any other info you think might help debugging, you may even consider printing the entire challenge object in the case of an error) to every trace/log output from this method. Debugging is going to be a pain anyway, given that our db is nosql, but it will be harder still without a challenge id.

retry := ch.getRetry()
switch {
case retry.Owner != a.ordinal:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log here, "Retry of ACME Challenge %s has changed ownership to ordinal %d" or something. Let's make it easy to understand what's happened.

// write update the challenge record in the db if the challenge has remaining retry attempts.
//
// see: ValidateChallenge
func (a *Authority) RetryChallenge(chID string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More logging.

  1. Anywhere we return due to error we should log the error and, at the very least, the challenge id.

  2. Anywhere we return for logical reasons, consider logging as well. (e.g. the ordinal has changed).

switch {
case err != nil:
return
case t.Before(now):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave a comment about this check. Just reading the code I'm not sure why this matters. We only want to continue with a retry attempt if the retryTime is in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I need to flip this. We want to exit if next-attempt is in the future, not the past.

}
ch = up

p, err := a.LoadProvisionerByID(retry.ProvisionerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't ignore these errors. We should log and try to exit ceremoniously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for someone to remove a provisioner? If so we may just want a warning but not try to exit the entire process.

ch = up

p, err := a.LoadProvisionerByID(retry.ProvisionerID)
acc, err := a.GetAccount(p, ch.getAccountID())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


if err := upd.save(db, dc); err != nil {
return nil, err
// KeyAuthorization creates the ACME key authorization value from a token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is only used in this file. Probably doesn't need to be exported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add this, just moved it down to the bottom. I can make it private if it's not intended to be called by others.

dcow added 7 commits May 18, 2020 04:06
Add tests for the starting challenge statuses.
Removed unneeded db write test.
This logic was already in the correct form so it was much easier to
update.
Add `golanglint-ci` to the modules so it's available when running `make
lint`.
@codecov-commenter
Copy link

codecov-commenter commented May 21, 2020

Codecov Report

Merging #242 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #242   +/-   ##
=======================================
  Coverage   73.84%   73.84%           
=======================================
  Files          76       76           
  Lines        8921     8921           
=======================================
  Hits         6588     6588           
  Misses       2004     2004           
  Partials      329      329           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 112fc59...112fc59. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement RFC8555 (ACME spec) § 8.2
6 participants