api: retry the getDirectory request on DNS errors #1280

andir · 2020-11-03T21:27:22Z

In some situations, even thought you have proper start-up ordering, DNS might briefly be unavailable when the lego units are started.
This is especially critical as systemd doesn't enforce ordering after targets (e.g. the nss-lookup.target) when configuration is being changed and both the local DNS resolver & lego are being restarted at the same time.

In some situations, even thought you have proper start-up ordering, DNS might briefly be unavailable when the lego units are started. This is especially critical as systemd doesn't enforce ordering after targets (e.g. the nss-lookup.target) when configuration is being changed and both the local DNS resolver & lego are being restarted at the same time.

ldez · 2020-11-03T21:37:01Z

Hello,

the getDirectory is not related to a DNS server but to the ACME server, so I don't understand your PR.

it seems to me that you're just trying to put a delay before launching lego.

andir · 2020-11-03T21:53:21Z

Hello,

the getDirectory is not related to a DNS server but to the ACME server, so I don't understand your PR.

it seems to me that you're just trying to put a delay before launching lego.

This is about a local DNS resolver on the machine that is being queries as part of the request to caDirURL. This is required as the application has to resolve the DNS name of the letsencrypt API endpoint to one (or a list of) IP addresses.

I am explicitly not trying to add a fixed delay before the startup of lego. What I am trying to cope here is a transient error with the local DNS server. In this case a local recursor might not yet be able to answer the query (in time) as it might still be busy binding to sockets, resolving the root zone, …. The query then fails (in one way or another).

During system startup the unit ordering (Before, After, WantedBy, …) of systemd are sufficient to have services come up in the right order. While the system is running there is apparently no way to have a defined order of unit reloading/restarting and thus the local recursive DNS server might be reloaded in the very moment you are trying to deploy new certificates. This happens when you deploy a new set of configuration files and systemd units to request new certificates are prerequisite to reloading the webserver.

Gracefully handling this transient error as proposed in this PR looks like the most reasonable way to mitigate it at this point in time.

We've been running into this issue on NixOS whenever we reload multiple system services at the same time. lego would fail with a message such as the following:

Nov 03 21:14:51 bertha systemd[1]: Starting Renew ACME certificate for host.tld...
Nov 03 21:14:52 bertha acme-host.tld-start[11959]: 2020/11/03 21:14:52 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: no such host
Nov 03 21:14:52 bertha systemd[1]: acme-epsilon.host.tld.service: Main process exited, code=exited, status=1/FAILURE

ldez · 2020-11-03T22:00:52Z

Wouldn't it be more interesting if you added this validation before launching lego?

The package api is the core ACME API, I am very reluctant to add this kind of validation within this package.

andir · 2020-11-03T22:27:36Z

That would probably be one potential workaround for the issue at hand. My reasoning for having it in here is that software should be resilient against flaky network connections in general.

Grepping through the code and seeing the given example (below) made me believe that more graceful error handling would fit as it it has been done previously already:

lego/acme/api/api.go

Lines 80 to 117 in 77f171f

    
           func (a *Core) retrievablePost(uri string, content []byte, response interface{}) (*http.Response, error) { 
        
           	// during tests, allow to support ~90% of bad nonce with a minimum of attempts. 
        
           	bo := backoff.NewExponentialBackOff() 
        
           	bo.InitialInterval = 200 * time.Millisecond 
        
           	bo.MaxInterval = 5 * time.Second 
        
           	bo.MaxElapsedTime = 20 * time.Second 
        
           	ctx, cancel := context.WithCancel(context.Background()) 
        
           	var resp *http.Response 
        
           	operation := func() error { 
        
           		var err error 
        
           		resp, err = a.signedPost(uri, content, response) 
        
           		if err != nil { 
        
           			// Retry if the nonce was invalidated 
        
           			var e *acme.NonceError 
        
           			if errors.As(err, &e) { 
        
           				return err 
        
           			} 
        
           			cancel() 
        
           			return err 
        
           		} 
        
           		return nil 
        
           	} 
        
           	notify := func(err error, duration time.Duration) { 
        
           		log.Infof("retry due to: %v", err) 
        
           	} 
        
           	err := backoff.RetryNotify(operation, backoff.WithContext(bo, ctx), notify) 
        
           	if err != nil { 
        
           		return resp, err 
        
           	} 
        
           	return resp, nil 
        
           }

While a check before executing the application would help with issues present during startup but having the proposed changes in the code would make the entire code more stable - regardless of any previously ran one-off checks.

ldez · 2020-11-03T22:40:24Z

retrievablePost is related to nonce errors: these kinds of retries are expected by the ACME server. (in pebble there is an option only to validate this behavior), it's not comparable to your PR.

You added this retry only because currently it is the first call that is made but if I change the first call this modification will not work anymore, so it doesn't give me a feeling of improved stability.

I understand your issue, but I still think that your proposal is not the right way to do to that.

andir · 2020-11-03T22:44:29Z

Ok, I guess applying retry logic to all requests could work then? Where would be the best place in the code to handle this in your opinion?

ldez · 2020-11-03T22:57:14Z

Ok, I guess applying retry logic to all requests could work then?

I didn't mean to imply that at all.

lego is mainly used as a lib, and it's also a CLI tool, I don't think that the lego lib has to handle this kind of issue.

I think that lego doesn't have to handle that, or maybe only in the CLI context.

flokli · 2020-11-04T08:48:14Z

Hm, having some sort of retry behaviour in libraries when they deal with network access seems to be not too uncontroversial - see https://godoc.org/cloud.google.com/go/storage:

All of the methods of this package use exponential backoff to retry calls that fail with certain errors, as described in https://cloud.google.com/storage/docs/exponential-backoff. Retrying continues indefinitely unless the controlling context is canceled or the client is closed. See context.WithTimeout and context.WithCancel.

I'm not entirely set on the "we'll retry indefinitely unless context is called with timeouts" part - some people might not be fully aware of contexts and did rely on things to eventually error - but at least the "we abstract some network flakyness away from users of a library, so all consumers don't need to implement this by themselves" doesn't sound too bad to me to add into lego the library.

ldez · 2020-11-04T09:05:39Z

I use the word "context" to talk about a scope/context, not a Go Context, I don't want to add a timeout or deadline.

flokli · 2020-11-04T11:15:19Z

Yes, I understood.

I don't want to add a timeout or deadline.

Just to make sure I correctly understand that part - so you would prefer an indefinite exponential backoff in the library part over some hardcoded/configurable timeouts?

Or you still don't want any backoff and retry logic in the library, and every consumer of the library(including the lego CLI) should write that logic by themselves?

ldez · 2020-11-04T12:12:56Z

Just to make sure I correctly understand that part - so you would prefer an indefinite exponential backoff in the library part over some hardcoded/configurable timeouts?

You're simplifying what I'm saying, this is not what I mean.

Or you still don't want any backoff and retry logic in the library, and every consumer of the library(including the lego CLI) should write that logic by themselves?

I'm talking in the scope of this PR: to know if a DNS is ready before making queries.
Based on that, a discussion about a global retry system on the core API is a bit off-topic.

andir force-pushed the api-getDirectory-retry branch from 32afa01 to 77f171f Compare November 3, 2020 21:29

ldez self-requested a review November 3, 2020 21:34

ldez added the contrib/waiting-for-feedback label Nov 3, 2020

andir mentioned this pull request Nov 3, 2020

ACME renewals fail due to DNS being unavailable during switch NixOS/nixpkgs#85794

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api: retry the getDirectory request on DNS errors #1280

api: retry the getDirectory request on DNS errors #1280

andir commented Nov 3, 2020 •

edited by ldez

ldez commented Nov 3, 2020 •

edited

andir commented Nov 3, 2020

ldez commented Nov 3, 2020

andir commented Nov 3, 2020

ldez commented Nov 3, 2020

andir commented Nov 3, 2020

ldez commented Nov 3, 2020 •

edited

flokli commented Nov 4, 2020

ldez commented Nov 4, 2020 •

edited

flokli commented Nov 4, 2020

ldez commented Nov 4, 2020

api: retry the getDirectory request on DNS errors #1280

Are you sure you want to change the base?

api: retry the getDirectory request on DNS errors #1280

Conversation

andir commented Nov 3, 2020 • edited by ldez

ldez commented Nov 3, 2020 • edited

andir commented Nov 3, 2020

ldez commented Nov 3, 2020

andir commented Nov 3, 2020

ldez commented Nov 3, 2020

andir commented Nov 3, 2020

ldez commented Nov 3, 2020 • edited

flokli commented Nov 4, 2020

ldez commented Nov 4, 2020 • edited

flokli commented Nov 4, 2020

ldez commented Nov 4, 2020

andir commented Nov 3, 2020 •

edited by ldez

ldez commented Nov 3, 2020 •

edited

ldez commented Nov 3, 2020 •

edited

ldez commented Nov 4, 2020 •

edited