Is it possible to exit after a single successful snapshot? #725

jeremych1000 · 2024-04-05T16:58:45Z

I may be using this completely wrong, but I want two things.

a long running deployment that does the full/delta snapshots
a way of manually triggering a full snapshot

For the long running deployment I'm using a Kubernetes deployment and it works well.

For the manual trigger, I'm using a Kubernetes cronjob to enable manually triggering a full snapshot, but I can't get etcd-backups to exit!

I want a one-time full snapshot, and etcd-backups to exit 0.

I've tried not putting in a schedule, and setting --delta-snapshot-period=0 but this doesn't do anything - it still runs forever with some sort of default schedule.

I'm using v0.28.0 and here is my config.

containers:
- name: {{ .Chart.Name }}
  image: {{ .Values.image.repo }}/{{ .Values.image.name }}:{{ .Values.image.tag }}
  args:
  - snapshot
  - --endpoints=https://127.0.0.1:2379
  - --cacert=/etc/kubernetes/pki/etcd/ca.crt
  - --cert=/etc/kubernetes/pki/etcd/server.crt
  - --key=/etc/kubernetes/pki/etcd/server.key
  - --compress-snapshots
  - --garbage-collection-policy={{ .Values.backup.strategy }}
  # disable delta snapshotting in the cronjob 
  # https://github.com/gardener/etcd-backup-restore/blob/rel-v0.27/chart/etcd-backup-restore/values.yaml#L47
  - --delta-snapshot-period=0 
  - --max-backups={{ .Values.backup.maxBackups }}
  - --storage-provider=S3
  - --store-container={{ .Values.s3.bucket }}
  - --store-prefix={{ .Values.s3.prefix }}

The text was updated successfully, but these errors were encountered:

renormalize · 2024-04-08T15:22:46Z

Thanks for raising the issue @jeremych1000.

etcd-backup-restore is not designed to exit after a successful execution of a full/delta snapshot as you want it to, for it to work when it is run in a Kubernetes CronJob.

To address your other comments:

Not tried putting in a schedule:
etcd-backup-restore just assumes a default schedule if one is explicitly not provided, where the default period for delta snapshot is 20s, and the default period for full snapshot is 1 hour.
Setting --delta-snapshot-period=0
Setting the delta snapshot period to 0 turns off delta snapshots. With this being the case, the full snapshots are still scheduled to be taken, and that is why the process does not exit with code 0.

I don't understand the reasoning behind deploying etcd-backup-restore in a CronJob if the need is a manual trigger of a snapshot. Why would you deploy a CronJob for this instead of a Job? To trigger snapshots with a Cron schedule, just use the Cron schedule capabilities of etcd-backup-restore.

To achieve what you need, might I suggest the server subcommand instead of the snapshot subcommand.

The server command sets up an HTTP server which exposes an endpoint, default being the port 8080 and various paths are available at this port. This server performs the same actions that you expect the snapshot command, along with additional features. Docs for the server command can be found here.
Running the server provides the capability to perform out of schedule snapshots.
To trigger a full snapshot out of schedule, create a POST request to the /snapshot/full path at the exposed port of the server. Same applies for delta snapshots.
The server command also initializes the etcd database before the snapshotting flow begins, i.e. checks if the database is valid or corrupt, if it is corrupt - it restores from the remote object store. Docs for the initialization command can be found here.

If manual triggering of a full snapshot is all you want, you could simply POST a curl request to the already mentioned endpoint, and what you require would be achieved.
You could also just deploy a Job that does nothing but send a curl request and waits for a successful response before exiting, as you expect Jobs to after they finish executing successfully.

The Helm charts of etcd-backup-restore are not fully maintained and all features may not be available to you through the Helm charts; however the above mentioned features are. To make use of all capabilities, I suggest you also take a look at the operator of etcd clusters, gardener/etcd-druid. It makes use of etcd-backup-restore by deploying it as a side-car container to the etcd cluster nodes, and helps in provisioning, maintenance and deletion.

renormalize · 2024-04-10T07:13:49Z

Closing this issue since the author has not replied.
Reopen if necessary.

jeremych1000 · 2024-04-26T16:46:28Z

Picking this up again, I'm now trying to configure it in server mode as a separate deployment. Running into two issues.

I can't seem to define the schedule. I've tried the following. It either complains that it that --schedule isn't a valid flag, or that it fails to convert string to int. I'm using the default schedule so I could always omit it but I'm not sure what i'm doing wrong?

--schedule "0 */1 * * *"
--schedule="0 */1 * * *"
--schedule 0 */1 * * *
--schedule=0 */1 * * *

In my server deployment configuration I have the same command line parameters as the snapshot one, but it fails to start with

time="2024-04-26T16:38:10Z" level=fatal msg="failed to read etcd config file: unable to read etcd config file at path: /var/etcd/config/etcd.conf.yaml : open /var/etcd/config/etcd.conf.yaml: no such file or directory" actor=backup-r
estore-server configFile=/var/etcd/config/etcd.conf.yaml

I've tried looking into where this is - https://github.com/gardener/etcd-backup-restore/blob/master/pkg/miscellaneous/miscellaneous.go#L338C5-L338C139 - but I don't see any docs about requiring a ETCD_CONF envvar or mounting this?

jeremych1000 · 2024-04-26T16:51:36Z

I can't seem to reopen the issue.

jeremych1000 · 2024-04-26T16:55:13Z

I additionally see https://github.com/gardener/etcd-backup-restore/blob/master/example/01-etcd-config.yaml but is this required to run in server mode? It implies it's only for testing purposes?

renormalize · 2024-04-27T09:36:08Z

@jeremych1000

You'll see that the comment on line 338 mentions
For tests or to run backup-restore server as standalone, user needs to set ETCD_CONF variable with proper location of ETCD config yaml.

etcd-backup-restore/pkg/miscellaneous/miscellaneous.go

Lines 335 to 339 in a7fc188

    
           // GetConfigFilePath returns the path of the etcd configuration file 
        
           func GetConfigFilePath() string { 
        
           	// (For testing purpose) If no ETCD_CONF variable set as environment variable, then consider backup-restore server is not used for tests. 
        
           	// For tests or to run backup-restore server as standalone, user needs to set ETCD_CONF variable with proper location of ETCD config yaml 
        
           	etcdConfigForTest := os.Getenv("ETCD_CONF")

Thus, to run etcd-backup-restore in a standalone fashion (a Deployment in your case), you need to specify an environment variable ETCD_CONF which points to a configuration file in yaml which provides etcd-backup-restore information about the etcd cluster that it is backing up.

I can understand why there was confusion since the necessity for the ETCD_CONF environment variable for standalone configurations is mentioned here in the source as a comment, but is not in the documentation; and the contexts in which it is mentioned in the documentation is about testing only, which is unfortunate.

The documentation around ETCD_CONF should be enhanced to signify that it is necessary to be exported with a valid file when running the server command.

The comment in https://github.com/gardener/etcd-backup-restore/blob/master/example/01-etcd-config.yaml should be changed to signify that the file as is can be used for testing, and the template it provides must be followed to run etcd-backup-restore as a server.

I'm not able to replicate the errors you're seeing with the --schedule flag. Could you tell me which release/commit you're running?
If the ETCD_CONF environment variable is exported, the --schedule flag works as expected for me on master. Even when it is not, the program flow should stop due to the missing ETCD_CONF.

The lack of documentation in this case might be due to the fact that etcd-backup-restore is typically used as a sidecar container to an etcd, by deploying both containers (an etcd, and etcd-backup-restore) through the Etcd CR as defined by gardener/etcd-druid. It will be enhanced.

jeremych1000 · 2024-04-29T08:20:42Z

Thanks. I presume this is used to configure etcd-backup-restore, as in, it needs actual values? How would I go about finding this out? I'm on a cluster which is provisioned by someone else.

renormalize · 2024-04-29T10:22:09Z

Yeah, you'd need the actual values that correspond to the etcd cluster that etcd-backup-restore is acting on.
The primary values that you need to configure are the initial-cluster which is the endpoints of the etcd cluster members, TLS for communication, the data directory (the Persistent Volume where the etcd DB is stored which is mounted to etcd-backup-restore).

Since your etcd cluster is provisioned by someone else, you should contact them for information about the etcd cluster. It would be fairly easy to fetch information about the etcd cluster through etcdctl.

jeremych1000 · 2024-04-29T10:27:08Z

Yeah, you'd need the actual values that correspond to the etcd cluster that etcd-backup-restore is acting on. The primary values that you need to configure are the initial-cluster which is the endpoints of the etcd cluster members, TLS for communication, the data directory (the Persistent Volume where the etcd DB is stored which is mounted to etcd-backup-restore).

Since your etcd cluster is provisioned by someone else, you should contact them for information about the etcd cluster. It would be fairly easy to fetch information about the etcd cluster through etcdctl.

Thanks - I have access to the etcd pods so will look through the pod definitions and work backwards from there.

Are all the values used, or is there a minimum subset of required values? Is there any documentation on which lines are used for backup, and which for restore purposes?

renormalize · 2024-04-29T10:50:27Z

If you can exec into a pod, it'll be quite easy to fetch info through etcdctl.
There isn't really any documentation on which lines of this config are used for backup, and restore.

It definitely will be useful for consumers of etcd-backup-restore without using etcd-druid. If you have any observations about all of this, you're more than welcome to raise a PR to add documentation for this.

Also, I'd say try it out with a single member etcd first to make it easier for yourself, instead of having to deal with the complexities of a multi member etcd cluster.

jeremych1000 · 2024-04-29T13:06:58Z

If you can exec into a pod, it'll be quite easy to fetch info through etcdctl. There isn't really any documentation on which lines of this config are used for backup, and restore.

It definitely will be useful for consumers of etcd-backup-restore without using etcd-druid. If you have any observations about all of this, you're more than welcome to raise a PR to add documentation for this.

Also, I'd say try it out with a single member etcd first to make it easier for yourself, instead of having to deal with the complexities of a multi member etcd cluster.

Thanks - I've got it launching at least with no crashloops (had to add the configmap, plus POD_NAME and POD_NAMESPACE in the deployment spec). However I'm still confused to what the server does?

In the logs of the pod I can see stuff like

time="2024-04-29T13:03:53Z" level=info msg="Starting HTTP server at addr: :8080" actor=backup-restore-server                                                                                                                            
time="2024-04-29T13:03:53Z" level=info msg="Etcd is now running. Continuing br startup"                                                                                                                                                 
time="2024-04-29T13:03:53Z" level=info msg="Attempting to update the member Info: <REDACTED POD NAME>" actor=member-add                                                                                   
time="2024-04-29T13:03:53Z" level=info msg="Updating member peer URL for <REDACTED POD NAME>" actor=member-add                                                                                            
time="2024-04-29T13:03:53Z" level=info msg="Attempting to update the member Info: <REDACTED POD NAME>" actor=member-add                                                                                   
time="2024-04-29T13:03:53Z" level=info msg="Updating member peer URL for <REDACTED POD NAME>" actor=member-add                                                                                            
time="2024-04-29T13:03:53Z" level=info msg="Attempting to update the member Info: <REDACTED POD NAME>" actor=member-add                                                                                   
time="2024-04-29T13:03:53Z" level=info msg="Updating member peer URL for <REDACTED POD NAME>" actor=member-add                                                                                            
time="2024-04-29T13:03:53Z" level=info msg="Attempting to update the member Info: <REDACTED POD NAME>" actor=member-add                                                                                   
time="2024-04-29T13:03:53Z" level=info msg="Updating member peer URL for <REDACTED POD NAME>" actor=member-add                                                                                            
time="2024-04-29T13:03:53Z" level=error msg="failed to update member peer url: could not fetch member URL : could not parse peer URL from the config file : invalid peer URL : http://127.0.0.1:2380" actor=backup-restore-server       
time="2024-04-29T13:03:53Z" level=info msg="Creating leaderElector..." actor=backup-restore-server                                                                                                                                      
time="2024-04-29T13:03:53Z" level=info msg="Starting leaderElection..." actor=leader-elector

I thought it's only supposed to connect to the existing etcd in the cluster like the snapshot does? Why is it attempting to start leader elections?

renormalize · 2024-04-29T13:46:30Z

etcd-backup-restore is designed to run as a sidecar container to an etcd container in a single pod. In general, etcd is deployed in HA, by running 3 (or more) members.

This implies that there will be 3 instances of etcd-backup-restore that run as sidecars.
We obviously can not have 3 separate actors that are trying to back up a single etcd cluster. This is why the there is a leader election amongst the instances of etcd-backup-restore.
The leader will always be the etcd-backup-restore container that is the sidecar to the etcd member that is the leader.

The attempt to elect a leader should not stop etcd-backup-restore from backing up snapshots if you have only one instance of it running like I assume you are.

jeremych1000 · 2024-04-29T14:07:08Z

etcd-backup-restore is designed to run as a sidecar container to an etcd container in a single pod. In general, etcd is deployed in HA, by running 3 (or more) members.

This implies that there will be 3 instances of etcd-backup-restore that run as sidecars. We obviously can not have 3 separate actors that are trying to back up a single etcd cluster. This is why the there is a leader election amongst the instances of etcd-backup-restore. The leader will always be the etcd-backup-restore container that is the sidecar to the etcd member that is the leader.

The attempt to elect a leader should not stop etcd-backup-restore from backing up snapshots if you have only one instance of it running like I assume you are.

Thanks. I'm getting close!

I have 3 etcd members running currently, deployed by the cluster operators. Currently etcd-backup-restore is deployed as its own deployment and pod, and not as a sidecar (as I don't control how the cluster etcd gets deployed). This works well for the snapshot function.

For the server function I have now gotten it into a state where it's up and stable, and it responds to requests (I can't see a list of URL endpoints anywhere - do you have any? It says 404 not found for most commands).

If I curl /snapshot/full, it returns

time="2024-04-29T14:00:59Z" level=info msg="Fowarding the request to take out-of-schedule full snapshot to backup-restore leader" actor=backup-restore-server                                                                           
time="2024-04-29T14:00:59Z" level=warning msg="Unable to check backup leader health: Get \"https://172.19.112.7:8080/healthz\": dial tcp 172.19.112.7:8080: connect: connection refused" actor=backup-restore-server

How would I tell etcd-backup-restore that there is only 1 copy of itself running?

anveshreddy18 · 2024-04-29T15:57:28Z

Adding to what @renormalize has said

How would I tell etcd-backup-restore that there is only 1 copy of itself running?

Assuming that you're running the deployment with 1 replica, the etcd config should be set along the the lines as is seen in the example config used for testing

etcd-backup-restore/example/01-etcd-config.yaml

Lines 4 to 17 in a7fc188

    
           name: etcd 
        
           data-dir: "default.etcd" 
        
           metrics: extensive 
        
           snapshot-count: 75000 
        
           enable-v2: false 
        
           quota-backend-bytes: 8589934592 # 8Gi 
        
           listen-client-urls: http://0.0.0.0:2379 
        
           advertise-client-urls: http://0.0.0.0:2379 
        
           initial-advertise-peer-urls: http://0.0.0.0:2380 
        
           initial-cluster: etcd=http://0.0.0.0:2380 
        
           initial-cluster-token: new 
        
           initial-cluster-state: new 
        
           auto-compaction-mode: periodic 
        
           auto-compaction-retention: 30m

This was recently added to the repo in this PR to make it easy to test backup-restore with any etcd process running. Pls make sure you have not added any extra configuration apart from the minimum required ones mentioned for initial testing.

jeremych1000 · 2024-04-29T16:06:51Z

I'm running HA etcd with 3 replicas. I've tried like 10 different combinations of ports and service names, so close yet so far. I also tried initial-cluster-state: existing as I don't want etcd-backup-restore to spin up its own etcd.

With most configs (as with the one above) it errors with the below. Do I have to hardcode the actual etcd pod names in the initial-cluster key maybe?

time="2024-04-29T15:59:52Z" level=info msg="etcd-backup-restore Version: v0.28.0"                                                                                                                                                       
time="2024-04-29T15:59:52Z" level=info msg="Git SHA: 727e957b"                                                                                                                                                                          
time="2024-04-29T15:59:52Z" level=info msg="Go Version: go1.20.3"                                                                                                                                                                       
time="2024-04-29T15:59:52Z" level=info msg="Go OS/Arch: linux/amd64"                                                                                                                                                                    
time="2024-04-29T15:59:52Z" level=info msg="compressionConfig:\n  enabled: true\n  policy: gzip\ndefragmentationSchedule: 0 0 */3 * *\netcdConnectionConfig:\n  caFile: /etc/kubernetes/pki/etcd/ca.crt\n  certFile: /etc/kubernetes/pki
time="2024-04-29T15:59:52Z" level=info msg="Setting status to : 503" actor=backup-restore-server                                                                                                                                        
time="2024-04-29T15:59:52Z" level=info msg="Registering the http request handlers..." actor=backup-restore-server                                                                                                                       
time="2024-04-29T15:59:52Z" level=info msg="Starting the http server..." actor=backup-restore-server                                                                                                                                    
time="2024-04-29T15:59:52Z" level=info msg="Checking if etcd is running"                                                                                                                                                                
time="2024-04-29T15:59:52Z" level=info msg="Starting HTTP server at addr: :8080" actor=backup-restore-server                                                                                                                            
time="2024-04-29T15:59:52Z" level=info msg="Etcd is now running. Continuing br startup" 
time="2024-04-29T15:59:52Z" level=info msg="Updating member peer URL for etcd-backups-server-6d8d7bd5b6-rtkfn" actor=member-add                                                                                            
time="2024-04-29T15:59:52Z" level=error msg="failed to update member peer url: could not fetch member URL : could not parse peer URL from the config file : invalid peer URL : http://0.0.0.0:2380" actor=backup-restore-server

renormalize · 2024-04-29T16:25:44Z

All endpoints exposed by etcd-backup-restore:

etcd-backup-restore/pkg/server/httpAPI.go

Lines 132 to 139 in a7fc188

    
           mux.HandleFunc("/initialization/start", h.serveInitialize) 
        
           mux.HandleFunc("/initialization/status", h.serveInitializationStatus) 
        
           mux.HandleFunc("/snapshot/full", h.serveFullSnapshotTrigger) 
        
           mux.HandleFunc("/snapshot/delta", h.serveDeltaSnapshotTrigger) 
        
           mux.HandleFunc("/snapshot/latest", h.serveLatestSnapshotMetadata) 
        
           mux.HandleFunc("/config", h.serveConfig) 
        
           mux.HandleFunc("/healthz", h.serveHealthz) 
        
           mux.Handle("/metrics", promhttp.Handler())

Seems like we've hit a wall here. There is no way we can tell etcd-backup-restore that it's the only replica running and it should be the leader: etcd-backup-restore couples itself tightly to the etcd running alongside.

In the endpoint list (the --endpoints flag, the client url for the etcd running alongside) that is provided to the etcd-backup-restore in the configuration, etcd-backup-restore checks if the etcd at this endpoint is the leader, and if it is: it becomes the leader in the etcd-backup-restore "cluster", and proceeds to perform the functionality you want from etcd-backup-restore.

Now, to make your etcd-backup-restore single replica the leader, you must somehow provide the client endpoint of the leading etcd member of the HA etcd cluster, so that this singular etcd-backup-restore member considers itself the leader.

How could that be done? I'm at a loss. The fact that etcd-backup-restore by design runs as a single member with a single member etcd, or a 3 member cluster with a 3 member etcd, is what is causing this issue.

Each etcd-backup-restore replica relies on its accompanying etcd for the privilege to take snapshots.

jeremych1000 · 2024-04-29T16:26:42Z

Update - I got it working if I force the server pod to be on the same node as the current etcd elected master. If it was on any other control plane node it didn't work.

A follow up question then - the POST request to take a snapshot worked, and I got a JSON payload back. However the server itself was still in the loop of taking delta and full snapshots even though I didn't define a schedule?

How can I disable this? I was under the impression using the server keyword would stop any scheduled backups.

renormalize · 2024-04-29T16:33:54Z

The server is an enhancement of the snapshot command as I've explained in detail in #725 (comment). It is designed to always take full snapshots. It lets an operator perform out-of-schedule snapshots.

As explained in the documentation pointed to by the above linked comment:

Etcdbrctl server
With sub-command server you can start a http server which exposes an endpoint to initialize etcd over REST interface. The server also keeps the backup schedule thread running to keep taking periodic backups. This is mainly made available to manage an etcd instance running in a Kubernetes cluster. You can deploy the example helm chart on a Kubernetes cluster to have a fault-resilient, self-healing etcd cluster.

The server also keeps the backup schedule thread running to keep taking periodic backups.

If you don't want delta snapshots, just set the value to lesser than 1.

You can't really disable full snapshots. Set it to a really long period so it doesn't bother you?

renormalize · 2024-04-29T16:42:16Z

Anything else @jeremych1000?

jeremych1000 · 2024-04-29T16:47:45Z

Anything else @jeremych1000?

Thank you very much for the responsiveness!

One final thing. Can I confirm there needs to be a 1 to 1 mapping (i.e. I can't use initial-cluster to define all the IPs of all the etcd replicas), so if I have 3 etcd I will need to run 3 copies of backup-restore, of which each are colocated on the same node as a replica of etcd.

As long as at least 1 replica succeed in taking backups I'm happy!

renormalize · 2024-04-30T05:53:16Z

You're right that there needs to be a 1-1 mapping between an etcd member and an etcd-backup-restore pod.

You run 3 replicas of etcd-backup-restore, where each is colocated on the same node as a replica of the etcd cluster, and the endpoint that is passed to etcd-backup-restore is the endpoint of the etcd member that it is colocated with. This is exactly what etcd-backup-restore does as well while running as a sidecar, be it simpler since the endpoint for the colocated etcd member is just localhost.

Once you maintain these three replicas with a 1:1 mapping, there will always be one etcd-backup-restore that takes snapshots.

Glad we've figured a solution out for you!

renormalize · 2024-04-30T05:56:01Z

I'd appreciate it if you can draft a PR by enhancing the docs if the way you're using etcd-backup-restore works as expected for you, since they're missing currently.

The docs to use etcd-backup-restore standalone by a consumer with an already existing etcd cluster that the operator can not touch, are lacking unfortunately. This would help more etcd-backup-restore make more approachable as an option!

jeremych1000 · 2024-04-30T12:34:29Z

Thanks, will do.

In terms of the schedule flag,

Error: unknown flag: --schedule "0 */1 * * *"                                                                                                                                                                                           
Usage:                                                                                                                                                                                                                                  
  etcdbrctl server [flags]                                                                                                                                                                                                              
                                                                                                                                                                                                                                        
Flags:                                                                                                                                                                                                                                  
unknown flag: --schedule "0 */1 * * *"

This is my helm template:

    spec:
      containers:
        - name: {{ .Release.Name }}
          image: {{ .Values.image.repo }}/{{ .Values.image.name }}:{{ .Values.image.tag }}
          args:
            - server
            - <...>
            # TODO - figure out how this works, i've tried adding quotes, using | quote, not using etc.
            # it defaults to "0 */1 * * *" which will work for now
            - --schedule {{ .Values.backup.schedule | quote }}

values.yaml

backup:
  strategy: Exponential
  maxBackups: 10 # ONLY USED IF STRATEGY == LIMITBASED
  schedule: "0 */1 * * *" # every hour

jeremych1000 · 2024-04-30T12:35:19Z

I'm running gardener-project/public/gardener/etcdbrctl:v0.28.0.

jeremych1000 · 2024-04-30T12:44:25Z

If I use -s {{ .Values.backup.schedule | quote }} instead of --schedule it says

time="2024-04-30T12:43:13Z" level=info msg="etcd-backup-restore Version: v0.28.0"                                                                                                                                                       
time="2024-04-30T12:43:13Z" level=info msg="Git SHA: 727e957b"                                                                                                                                                                          
time="2024-04-30T12:43:13Z" level=info msg="Go Version: go1.20.3"                                                                                                                                                                       
time="2024-04-30T12:43:13Z" level=info msg="Go OS/Arch: linux/amd64"                                                                                                                                                                    
time="2024-04-30T12:43:13Z" level=fatal msg="failed to validate the options: failed to parse int from \"0: strconv.Atoi: parsing \"\\\"0\": invalid syntax"

renormalize · 2024-05-01T09:11:32Z

I'm not able to figure out what the reason could be from the logs you've shared. When I run etcdbrctl with a custom schedule (one full snapshot a day) with delta snapshots disabled, I simply run:

./bin/etcdbrctl server --storage-provider=<PROVIDER> --store-container=<CONTAINER> --store-prefix=<YOUR_PREFIX> --schedule="0 0 * * *" --delta-snapshot-period=0

The only difference I see in your Helm chart and the chart that used to be maintained for etcd-backup-restore previously is that yours has the equal-to sign missing. Maybe give that a shot?

etcd-backup-restore/chart/etcd-backup-restore/templates/etcd-statefulset.yaml

Lines 96 to 100 in a7fc188

    
           - name: backup-restore 
        
             command: 
        
             - etcdbrctl 
        
             - server 
        
             - --schedule={{ .Values.backup.schedule }}

jeremych1000 · 2024-05-01T09:51:58Z

equal-to sign

Thanks, I could've sworn I tried that before but it now works - I also removed | quote.

I've got the end to end flow working as well. A bit janky maybe, but it works.

What I've done:

daemonset with tolerations for control plane nodes (this guarantees a pod on any etcd node)
readiness probe was set to port 2379 (therefore replicas which are on control plane nodes which don't have etcd on them, will always be 0/1 ready)
daemonset runs the server command, with a schedule we define which takes backups per hour
the daemonset is also fronted by a headless service i.e. nslookups return all IPs of active etcd-backup-restore replicas
as a backup, we have a cronjob that runs every day
this cronjob runs a python script
the python runs a nslookup against the headless service and gets IPs of ready replicas
it then loops through, sending a POST command to /snapshot/full
if it 405's, then that means that etcd wasn't the master
when it hits the master etcd, it will 200, and return a successful snapshot result

I couldn't find a way to make the etcd-backup-restores aware / talk to each other (would've loved to send one POST request to any replica, and then they forward it to the 'etcd-backup-restore' leader). I saw the initial-cluster config in etcd-config.yaml but I can't hardcode node names in my helm chart as they are not known ahead of time.

Happy to improve the docs but my usecase seems a bit niche / hacky. Which bits are you interested in for me to document further? Perhaps the requirements for 1-1 mapping?

renormalize · 2024-05-02T03:58:48Z

Thanks for summarizing what your setup is to use etcd-backup-restore for your use case, gives us insight into what people interested in etcd-backup-restore would like to use it as.

There are parts which could be generalized but the maintainers will probably take it up some other time, instead of dedicating time and effort right now for that.

I'm sure there's a way to make all the etcd-backup-restore pods aware of each other. I'll look into it when I get time.

renormalize · 2024-05-02T06:46:13Z

@jeremych1000 the maintainers would like to discuss about the pluggability of etcd-backup-restore with you, and to enhance etcd-backup-restore in the future.

Would you be okay with a call?
Our timezone is Indian Standard Time. Looking forward to your reply. If you're down for a call, will share a meeting link over email.

jeremych1000 · 2024-05-02T12:12:25Z

@jeremych1000 the maintainers would like to discuss about the pluggability of etcd-backup-restore with you, and to enhance etcd-backup-restore in the future.

Would you be okay with a call? Our timezone is Indian Standard Time. Looking forward to your reply. If you're down for a call, will share a meeting link over email.

Hello that would be useful. I'm in the UK, happy to discuss agenda and meeting times through email.

jeremych1000 · 2024-05-03T15:00:34Z

@renormalize quick question - with the server keyword, how would I configure what port the server listens on? I now want to run two copies of etcd-backup-restore for every etcd, as we have 2 locations to backup to for redundancy purposes.

I can't specify two buckets, nor can I specify what port the server spins up on (default 8080). With hostNetwork: true this means the second pod errors out.

"Failed to start http server: listen tcp :8080: bind: address already in use"

renormalize · 2024-05-04T12:25:43Z

Run the --help flag with server.

➜  etcd-backup-restore git:(master) ./bin/etcdbrctl server --help
Server will keep listening for http request to deliver its functionality through http endpoints.

Usage:
  etcdbrctl server [flags]

Flags:
      --auto-compaction-mode string                        mode for auto-compaction: 'periodic' for duration based retention. 'revision' for revision number based retention. (default "periodic")
...
...
  -p, --server-port uint                                   port on which server should listen (default 8080)

Backing up to two buckets simultaneously is not supported.

unmarshall · 2024-05-06T04:04:37Z

@jeremych1000 lets discuss your requirements. We are overhauling etcd-backup-restore and would like it very much to understand all of your use cases. This will help us in designing the next version. Can you please prepare the following:

Difficulties faced consuming etcd-backup-restore
Features that are currently not there and provide us with a use case for each so that we can better understand.

@renormalize can you please schedule a meeting.

renormalize self-assigned this Apr 8, 2024

renormalize closed this as completed Apr 10, 2024

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Apr 10, 2024

renormalize reopened this Apr 27, 2024

gardener-robot added status/accepted Issue was accepted as something we need to work on and removed status/closed Issue is closed (either delivered or triaged) labels Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to exit after a single successful snapshot? #725

Is it possible to exit after a single successful snapshot? #725

jeremych1000 commented Apr 5, 2024 •

edited

renormalize commented Apr 8, 2024 •

edited

renormalize commented Apr 10, 2024

jeremych1000 commented Apr 26, 2024

jeremych1000 commented Apr 26, 2024

jeremych1000 commented Apr 26, 2024

renormalize commented Apr 27, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

anveshreddy18 commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024 •

edited

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024 •

edited

renormalize commented Apr 30, 2024

renormalize commented Apr 30, 2024

jeremych1000 commented Apr 30, 2024

jeremych1000 commented Apr 30, 2024

jeremych1000 commented Apr 30, 2024

renormalize commented May 1, 2024

jeremych1000 commented May 1, 2024

renormalize commented May 2, 2024

renormalize commented May 2, 2024 •

edited

jeremych1000 commented May 2, 2024

jeremych1000 commented May 3, 2024

renormalize commented May 4, 2024

unmarshall commented May 6, 2024

Is it possible to exit after a single successful snapshot? #725

Is it possible to exit after a single successful snapshot? #725

Comments

jeremych1000 commented Apr 5, 2024 • edited

renormalize commented Apr 8, 2024 • edited

renormalize commented Apr 10, 2024

jeremych1000 commented Apr 26, 2024

jeremych1000 commented Apr 26, 2024

jeremych1000 commented Apr 26, 2024

renormalize commented Apr 27, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

anveshreddy18 commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024

renormalize commented Apr 29, 2024 • edited

renormalize commented Apr 29, 2024

jeremych1000 commented Apr 29, 2024 • edited

renormalize commented Apr 30, 2024

renormalize commented Apr 30, 2024

jeremych1000 commented Apr 30, 2024

jeremych1000 commented Apr 30, 2024

jeremych1000 commented Apr 30, 2024

renormalize commented May 1, 2024

jeremych1000 commented May 1, 2024

renormalize commented May 2, 2024

renormalize commented May 2, 2024 • edited

jeremych1000 commented May 2, 2024

jeremych1000 commented May 3, 2024

renormalize commented May 4, 2024

unmarshall commented May 6, 2024

jeremych1000 commented Apr 5, 2024 •

edited

renormalize commented Apr 8, 2024 •

edited

renormalize commented Apr 29, 2024 •

edited

jeremych1000 commented Apr 29, 2024 •

edited

renormalize commented May 2, 2024 •

edited