-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to exit after a single successful snapshot? #725
Comments
Thanks for raising the issue @jeremych1000.
To address your other comments:
I don't understand the reasoning behind deploying To achieve what you need, might I suggest the
If manual triggering of a full snapshot is all you want, you could simply The Helm charts of |
Closing this issue since the author has not replied. |
Picking this up again, I'm now trying to configure it in server mode as a separate deployment. Running into two issues.
I've tried looking into where this is - https://github.com/gardener/etcd-backup-restore/blob/master/pkg/miscellaneous/miscellaneous.go#L338C5-L338C139 - but I don't see any docs about requiring a ETCD_CONF envvar or mounting this? |
I can't seem to reopen the issue. |
I additionally see https://github.com/gardener/etcd-backup-restore/blob/master/example/01-etcd-config.yaml but is this required to run in server mode? It implies it's only for testing purposes? |
You'll see that the comment on line 338 mentions etcd-backup-restore/pkg/miscellaneous/miscellaneous.go Lines 335 to 339 in a7fc188
Thus, to run etcd-backup-restore in a standalone fashion (a Deployment in your case), you need to specify an environment variable ETCD_CONF which points to a configuration file in yaml which provides etcd-backup-restore information about the etcd cluster that it is backing up.
I can understand why there was confusion since the necessity for the The documentation around The comment in https://github.com/gardener/etcd-backup-restore/blob/master/example/01-etcd-config.yaml should be changed to signify that the file as is can be used for testing, and the template it provides must be followed to run I'm not able to replicate the errors you're seeing with the The lack of documentation in this case might be due to the fact that |
Thanks. I presume this is used to configure |
Yeah, you'd need the actual values that correspond to the etcd cluster that etcd-backup-restore is acting on. Since your etcd cluster is provisioned by someone else, you should contact them for information about the etcd cluster. It would be fairly easy to fetch information about the etcd cluster through |
Thanks - I have access to the etcd pods so will look through the pod definitions and work backwards from there. Are all the values used, or is there a minimum subset of required values? Is there any documentation on which lines are used for backup, and which for restore purposes? |
If you can exec into a pod, it'll be quite easy to fetch info through It definitely will be useful for consumers of etcd-backup-restore without using etcd-druid. If you have any observations about all of this, you're more than welcome to raise a PR to add documentation for this. Also, I'd say try it out with a single member etcd first to make it easier for yourself, instead of having to deal with the complexities of a multi member etcd cluster. |
Thanks - I've got it launching at least with no crashloops (had to add the configmap, plus In the logs of the pod I can see stuff like
I thought it's only supposed to connect to the existing etcd in the cluster like the snapshot does? Why is it attempting to start leader elections? |
etcd-backup-restore is designed to run as a sidecar container to an etcd container in a single pod. In general, etcd is deployed in HA, by running 3 (or more) members. This implies that there will be 3 instances of etcd-backup-restore that run as sidecars. The attempt to elect a leader should not stop etcd-backup-restore from backing up snapshots if you have only one instance of it running like I assume you are. |
Thanks. I'm getting close! I have 3 etcd members running currently, deployed by the cluster operators. Currently For the server function I have now gotten it into a state where it's up and stable, and it responds to requests (I can't see a list of URL endpoints anywhere - do you have any? It says 404 not found for most commands). If I
How would I tell |
Adding to what @renormalize has said
Assuming that you're running the deployment with 1 replica, the etcd config should be set along the the lines as is seen in the example config used for testing etcd-backup-restore/example/01-etcd-config.yaml Lines 4 to 17 in a7fc188
This was recently added to the repo in this PR to make it easy to test |
I'm running HA etcd with 3 replicas. I've tried like 10 different combinations of ports and service names, so close yet so far. I also tried With most configs (as with the one above) it errors with the below. Do I have to hardcode the actual etcd pod names in the
|
All endpoints exposed by etcd-backup-restore: etcd-backup-restore/pkg/server/httpAPI.go Lines 132 to 139 in a7fc188
Seems like we've hit a wall here. There is no way we can tell etcd-backup-restore that it's the only replica running and it should be the leader: etcd-backup-restore couples itself tightly to the etcd running alongside. In the endpoint list (the Now, to make your etcd-backup-restore single replica the leader, you must somehow provide the client endpoint of the leading etcd member of the HA etcd cluster, so that this singular etcd-backup-restore member considers itself the leader. How could that be done? I'm at a loss. The fact that etcd-backup-restore by design runs as a single member with a single member etcd, or a 3 member cluster with a 3 member etcd, is what is causing this issue. Each etcd-backup-restore replica relies on its accompanying etcd for the privilege to take snapshots. |
Update - I got it working if I force the server pod to be on the same node as the current etcd elected master. If it was on any other control plane node it didn't work. A follow up question then - the POST request to take a snapshot worked, and I got a JSON payload back. However the server itself was still in the loop of taking delta and full snapshots even though I didn't define a schedule? How can I disable this? I was under the impression using the server keyword would stop any scheduled backups. |
The As explained in the documentation pointed to by the above linked comment:
If you don't want delta snapshots, just set the value to lesser than 1. You can't really disable full snapshots. Set it to a really long period so it doesn't bother you? |
Anything else @jeremych1000? |
Thank you very much for the responsiveness! One final thing. Can I confirm there needs to be a 1 to 1 mapping (i.e. I can't use initial-cluster to define all the IPs of all the etcd replicas), so if I have 3 etcd I will need to run 3 copies of backup-restore, of which each are colocated on the same node as a replica of etcd. As long as at least 1 replica succeed in taking backups I'm happy! |
You're right that there needs to be a 1-1 mapping between an etcd member and an etcd-backup-restore pod. You run 3 replicas of etcd-backup-restore, where each is colocated on the same node as a replica of the etcd cluster, and the endpoint that is passed to etcd-backup-restore is the endpoint of the etcd member that it is colocated with. This is exactly what etcd-backup-restore does as well while running as a sidecar, be it simpler since the endpoint for the colocated etcd member is just localhost. Once you maintain these three replicas with a 1:1 mapping, there will always be one etcd-backup-restore that takes snapshots. Glad we've figured a solution out for you! |
I'd appreciate it if you can draft a PR by enhancing the docs if the way you're using etcd-backup-restore works as expected for you, since they're missing currently. The docs to use etcd-backup-restore standalone by a consumer with an already existing etcd cluster that the operator can not touch, are lacking unfortunately. This would help more etcd-backup-restore make more approachable as an option! |
Thanks, will do. In terms of the schedule flag,
This is my helm template:
values.yaml
|
I'm running |
If I use
|
I'm not able to figure out what the reason could be from the logs you've shared. When I run ./bin/etcdbrctl server --storage-provider=<PROVIDER> --store-container=<CONTAINER> --store-prefix=<YOUR_PREFIX> --schedule="0 0 * * *" --delta-snapshot-period=0 The only difference I see in your Helm chart and the chart that used to be maintained for etcd-backup-restore previously is that yours has the equal-to sign missing. Maybe give that a shot? etcd-backup-restore/chart/etcd-backup-restore/templates/etcd-statefulset.yaml Lines 96 to 100 in a7fc188
|
Thanks, I could've sworn I tried that before but it now works - I also removed I've got the end to end flow working as well. A bit janky maybe, but it works. What I've done:
I couldn't find a way to make the Happy to improve the docs but my usecase seems a bit niche / hacky. Which bits are you interested in for me to document further? Perhaps the requirements for 1-1 mapping? |
Thanks for summarizing what your setup is to use etcd-backup-restore for your use case, gives us insight into what people interested in etcd-backup-restore would like to use it as. There are parts which could be generalized but the maintainers will probably take it up some other time, instead of dedicating time and effort right now for that. I'm sure there's a way to make all the etcd-backup-restore pods aware of each other. I'll look into it when I get time. |
@jeremych1000 the maintainers would like to discuss about the pluggability of etcd-backup-restore with you, and to enhance etcd-backup-restore in the future. Would you be okay with a call? |
Hello that would be useful. I'm in the UK, happy to discuss agenda and meeting times through email. |
@renormalize quick question - with the I can't specify two buckets, nor can I specify what port the server spins up on (default 8080). With
|
Run the
Backing up to two buckets simultaneously is not supported. |
@jeremych1000 lets discuss your requirements. We are overhauling
@renormalize can you please schedule a meeting. |
I may be using this completely wrong, but I want two things.
For the long running deployment I'm using a Kubernetes deployment and it works well.
For the manual trigger, I'm using a Kubernetes cronjob to enable manually triggering a full snapshot, but I can't get etcd-backups to exit!
I want a one-time full snapshot, and etcd-backups to exit 0.
I've tried not putting in a schedule, and setting
--delta-snapshot-period=0
but this doesn't do anything - it still runs forever with some sort of default schedule.I'm using
v0.28.0
and here is my config.The text was updated successfully, but these errors were encountered: