Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to exit after a single successful snapshot? #725

Open
jeremych1000 opened this issue Apr 5, 2024 · 33 comments
Open

Is it possible to exit after a single successful snapshot? #725

jeremych1000 opened this issue Apr 5, 2024 · 33 comments
Assignees
Labels
status/accepted Issue was accepted as something we need to work on

Comments

@jeremych1000
Copy link

jeremych1000 commented Apr 5, 2024

I may be using this completely wrong, but I want two things.

  • a long running deployment that does the full/delta snapshots
  • a way of manually triggering a full snapshot

For the long running deployment I'm using a Kubernetes deployment and it works well.

For the manual trigger, I'm using a Kubernetes cronjob to enable manually triggering a full snapshot, but I can't get etcd-backups to exit!

I want a one-time full snapshot, and etcd-backups to exit 0.

I've tried not putting in a schedule, and setting --delta-snapshot-period=0 but this doesn't do anything - it still runs forever with some sort of default schedule.

I'm using v0.28.0 and here is my config.

containers:
- name: {{ .Chart.Name }}
  image: {{ .Values.image.repo }}/{{ .Values.image.name }}:{{ .Values.image.tag }}
  args:
  - snapshot
  - --endpoints=https://127.0.0.1:2379
  - --cacert=/etc/kubernetes/pki/etcd/ca.crt
  - --cert=/etc/kubernetes/pki/etcd/server.crt
  - --key=/etc/kubernetes/pki/etcd/server.key
  - --compress-snapshots
  - --garbage-collection-policy={{ .Values.backup.strategy }}
  # disable delta snapshotting in the cronjob 
  # https://github.com/gardener/etcd-backup-restore/blob/rel-v0.27/chart/etcd-backup-restore/values.yaml#L47
  - --delta-snapshot-period=0 
  - --max-backups={{ .Values.backup.maxBackups }}
  - --storage-provider=S3
  - --store-container={{ .Values.s3.bucket }}
  - --store-prefix={{ .Values.s3.prefix }}
@renormalize renormalize self-assigned this Apr 8, 2024
@renormalize
Copy link
Member

renormalize commented Apr 8, 2024

Thanks for raising the issue @jeremych1000.

etcd-backup-restore is not designed to exit after a successful execution of a full/delta snapshot as you want it to, for it to work when it is run in a Kubernetes CronJob.

To address your other comments:

  • Not tried putting in a schedule:
    etcd-backup-restore just assumes a default schedule if one is explicitly not provided, where the default period for delta snapshot is 20s, and the default period for full snapshot is 1 hour.
  • Setting --delta-snapshot-period=0
    Setting the delta snapshot period to 0 turns off delta snapshots. With this being the case, the full snapshots are still scheduled to be taken, and that is why the process does not exit with code 0.

I don't understand the reasoning behind deploying etcd-backup-restore in a CronJob if the need is a manual trigger of a snapshot. Why would you deploy a CronJob for this instead of a Job? To trigger snapshots with a Cron schedule, just use the Cron schedule capabilities of etcd-backup-restore.


To achieve what you need, might I suggest the server subcommand instead of the snapshot subcommand.

  • The server command sets up an HTTP server which exposes an endpoint, default being the port 8080 and various paths are available at this port. This server performs the same actions that you expect the snapshot command, along with additional features. Docs for the server command can be found here.
    Running the server provides the capability to perform out of schedule snapshots.
    To trigger a full snapshot out of schedule, create a POST request to the /snapshot/full path at the exposed port of the server. Same applies for delta snapshots.

  • The server command also initializes the etcd database before the snapshotting flow begins, i.e. checks if the database is valid or corrupt, if it is corrupt - it restores from the remote object store. Docs for the initialization command can be found here.


If manual triggering of a full snapshot is all you want, you could simply POST a curl request to the already mentioned endpoint, and what you require would be achieved.
You could also just deploy a Job that does nothing but send a curl request and waits for a successful response before exiting, as you expect Jobs to after they finish executing successfully.


The Helm charts of etcd-backup-restore are not fully maintained and all features may not be available to you through the Helm charts; however the above mentioned features are. To make use of all capabilities, I suggest you also take a look at the operator of etcd clusters, gardener/etcd-druid. It makes use of etcd-backup-restore by deploying it as a side-car container to the etcd cluster nodes, and helps in provisioning, maintenance and deletion.

@renormalize
Copy link
Member

Closing this issue since the author has not replied.
Reopen if necessary.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Apr 10, 2024
@jeremych1000
Copy link
Author

Picking this up again, I'm now trying to configure it in server mode as a separate deployment. Running into two issues.

  1. I can't seem to define the schedule. I've tried the following. It either complains that it that --schedule isn't a valid flag, or that it fails to convert string to int. I'm using the default schedule so I could always omit it but I'm not sure what i'm doing wrong?
--schedule "0 */1 * * *"
--schedule="0 */1 * * *"
--schedule 0 */1 * * *
--schedule=0 */1 * * *
  1. In my server deployment configuration I have the same command line parameters as the snapshot one, but it fails to start with
time="2024-04-26T16:38:10Z" level=fatal msg="failed to read etcd config file: unable to read etcd config file at path: /var/etcd/config/etcd.conf.yaml : open /var/etcd/config/etcd.conf.yaml: no such file or directory" actor=backup-r
estore-server configFile=/var/etcd/config/etcd.conf.yaml 

I've tried looking into where this is - https://github.com/gardener/etcd-backup-restore/blob/master/pkg/miscellaneous/miscellaneous.go#L338C5-L338C139 - but I don't see any docs about requiring a ETCD_CONF envvar or mounting this?

@jeremych1000
Copy link
Author

I can't seem to reopen the issue.

@jeremych1000
Copy link
Author

I additionally see https://github.com/gardener/etcd-backup-restore/blob/master/example/01-etcd-config.yaml but is this required to run in server mode? It implies it's only for testing purposes?

@renormalize renormalize reopened this Apr 27, 2024
@gardener-robot gardener-robot added status/accepted Issue was accepted as something we need to work on and removed status/closed Issue is closed (either delivered or triaged) labels Apr 27, 2024
@renormalize
Copy link
Member

@jeremych1000

You'll see that the comment on line 338 mentions
For tests or to run backup-restore server as standalone, user needs to set ETCD_CONF variable with proper location of ETCD config yaml.

// GetConfigFilePath returns the path of the etcd configuration file
func GetConfigFilePath() string {
// (For testing purpose) If no ETCD_CONF variable set as environment variable, then consider backup-restore server is not used for tests.
// For tests or to run backup-restore server as standalone, user needs to set ETCD_CONF variable with proper location of ETCD config yaml
etcdConfigForTest := os.Getenv("ETCD_CONF")

Thus, to run etcd-backup-restore in a standalone fashion (a Deployment in your case), you need to specify an environment variable ETCD_CONF which points to a configuration file in yaml which provides etcd-backup-restore information about the etcd cluster that it is backing up.

I can understand why there was confusion since the necessity for the ETCD_CONF environment variable for standalone configurations is mentioned here in the source as a comment, but is not in the documentation; and the contexts in which it is mentioned in the documentation is about testing only, which is unfortunate.

The documentation around ETCD_CONF should be enhanced to signify that it is necessary to be exported with a valid file when running the server command.

The comment in https://github.com/gardener/etcd-backup-restore/blob/master/example/01-etcd-config.yaml should be changed to signify that the file as is can be used for testing, and the template it provides must be followed to run etcd-backup-restore as a server.


I'm not able to replicate the errors you're seeing with the --schedule flag. Could you tell me which release/commit you're running?
If the ETCD_CONF environment variable is exported, the --schedule flag works as expected for me on master. Even when it is not, the program flow should stop due to the missing ETCD_CONF.


The lack of documentation in this case might be due to the fact that etcd-backup-restore is typically used as a sidecar container to an etcd, by deploying both containers (an etcd, and etcd-backup-restore) through the Etcd CR as defined by gardener/etcd-druid. It will be enhanced.

@jeremych1000
Copy link
Author

Thanks. I presume this is used to configure etcd-backup-restore, as in, it needs actual values? How would I go about finding this out? I'm on a cluster which is provisioned by someone else.

@renormalize
Copy link
Member

Yeah, you'd need the actual values that correspond to the etcd cluster that etcd-backup-restore is acting on.
The primary values that you need to configure are the initial-cluster which is the endpoints of the etcd cluster members, TLS for communication, the data directory (the Persistent Volume where the etcd DB is stored which is mounted to etcd-backup-restore).

Since your etcd cluster is provisioned by someone else, you should contact them for information about the etcd cluster. It would be fairly easy to fetch information about the etcd cluster through etcdctl.

@jeremych1000
Copy link
Author

Yeah, you'd need the actual values that correspond to the etcd cluster that etcd-backup-restore is acting on. The primary values that you need to configure are the initial-cluster which is the endpoints of the etcd cluster members, TLS for communication, the data directory (the Persistent Volume where the etcd DB is stored which is mounted to etcd-backup-restore).

Since your etcd cluster is provisioned by someone else, you should contact them for information about the etcd cluster. It would be fairly easy to fetch information about the etcd cluster through etcdctl.

Thanks - I have access to the etcd pods so will look through the pod definitions and work backwards from there.

Are all the values used, or is there a minimum subset of required values? Is there any documentation on which lines are used for backup, and which for restore purposes?

@renormalize
Copy link
Member

If you can exec into a pod, it'll be quite easy to fetch info through etcdctl.
There isn't really any documentation on which lines of this config are used for backup, and restore.

It definitely will be useful for consumers of etcd-backup-restore without using etcd-druid. If you have any observations about all of this, you're more than welcome to raise a PR to add documentation for this.

Also, I'd say try it out with a single member etcd first to make it easier for yourself, instead of having to deal with the complexities of a multi member etcd cluster.

@jeremych1000
Copy link
Author

If you can exec into a pod, it'll be quite easy to fetch info through etcdctl. There isn't really any documentation on which lines of this config are used for backup, and restore.

It definitely will be useful for consumers of etcd-backup-restore without using etcd-druid. If you have any observations about all of this, you're more than welcome to raise a PR to add documentation for this.

Also, I'd say try it out with a single member etcd first to make it easier for yourself, instead of having to deal with the complexities of a multi member etcd cluster.

Thanks - I've got it launching at least with no crashloops (had to add the configmap, plus POD_NAME and POD_NAMESPACE in the deployment spec). However I'm still confused to what the server does?

In the logs of the pod I can see stuff like

time="2024-04-29T13:03:53Z" level=info msg="Starting HTTP server at addr: :8080" actor=backup-restore-server                                                                                                                            
time="2024-04-29T13:03:53Z" level=info msg="Etcd is now running. Continuing br startup"                                                                                                                                                 
time="2024-04-29T13:03:53Z" level=info msg="Attempting to update the member Info: <REDACTED POD NAME>" actor=member-add                                                                                   
time="2024-04-29T13:03:53Z" level=info msg="Updating member peer URL for <REDACTED POD NAME>" actor=member-add                                                                                            
time="2024-04-29T13:03:53Z" level=info msg="Attempting to update the member Info: <REDACTED POD NAME>" actor=member-add                                                                                   
time="2024-04-29T13:03:53Z" level=info msg="Updating member peer URL for <REDACTED POD NAME>" actor=member-add                                                                                            
time="2024-04-29T13:03:53Z" level=info msg="Attempting to update the member Info: <REDACTED POD NAME>" actor=member-add                                                                                   
time="2024-04-29T13:03:53Z" level=info msg="Updating member peer URL for <REDACTED POD NAME>" actor=member-add                                                                                            
time="2024-04-29T13:03:53Z" level=info msg="Attempting to update the member Info: <REDACTED POD NAME>" actor=member-add                                                                                   
time="2024-04-29T13:03:53Z" level=info msg="Updating member peer URL for <REDACTED POD NAME>" actor=member-add                                                                                            
time="2024-04-29T13:03:53Z" level=error msg="failed to update member peer url: could not fetch member URL : could not parse peer URL from the config file : invalid peer URL : http://127.0.0.1:2380" actor=backup-restore-server       
time="2024-04-29T13:03:53Z" level=info msg="Creating leaderElector..." actor=backup-restore-server                                                                                                                                      
time="2024-04-29T13:03:53Z" level=info msg="Starting leaderElection..." actor=leader-elector                                                                                                                                            

I thought it's only supposed to connect to the existing etcd in the cluster like the snapshot does? Why is it attempting to start leader elections?

@renormalize
Copy link
Member

etcd-backup-restore is designed to run as a sidecar container to an etcd container in a single pod. In general, etcd is deployed in HA, by running 3 (or more) members.

This implies that there will be 3 instances of etcd-backup-restore that run as sidecars.
We obviously can not have 3 separate actors that are trying to back up a single etcd cluster. This is why the there is a leader election amongst the instances of etcd-backup-restore.
The leader will always be the etcd-backup-restore container that is the sidecar to the etcd member that is the leader.

The attempt to elect a leader should not stop etcd-backup-restore from backing up snapshots if you have only one instance of it running like I assume you are.

@jeremych1000
Copy link
Author

etcd-backup-restore is designed to run as a sidecar container to an etcd container in a single pod. In general, etcd is deployed in HA, by running 3 (or more) members.

This implies that there will be 3 instances of etcd-backup-restore that run as sidecars. We obviously can not have 3 separate actors that are trying to back up a single etcd cluster. This is why the there is a leader election amongst the instances of etcd-backup-restore. The leader will always be the etcd-backup-restore container that is the sidecar to the etcd member that is the leader.

The attempt to elect a leader should not stop etcd-backup-restore from backing up snapshots if you have only one instance of it running like I assume you are.

Thanks. I'm getting close!

I have 3 etcd members running currently, deployed by the cluster operators. Currently etcd-backup-restore is deployed as its own deployment and pod, and not as a sidecar (as I don't control how the cluster etcd gets deployed). This works well for the snapshot function.

For the server function I have now gotten it into a state where it's up and stable, and it responds to requests (I can't see a list of URL endpoints anywhere - do you have any? It says 404 not found for most commands).

If I curl /snapshot/full, it returns

time="2024-04-29T14:00:59Z" level=info msg="Fowarding the request to take out-of-schedule full snapshot to backup-restore leader" actor=backup-restore-server                                                                           
time="2024-04-29T14:00:59Z" level=warning msg="Unable to check backup leader health: Get \"https://172.19.112.7:8080/healthz\": dial tcp 172.19.112.7:8080: connect: connection refused" actor=backup-restore-server 

How would I tell etcd-backup-restore that there is only 1 copy of itself running?

@anveshreddy18
Copy link
Contributor

Adding to what @renormalize has said

How would I tell etcd-backup-restore that there is only 1 copy of itself running?

Assuming that you're running the deployment with 1 replica, the etcd config should be set along the the lines as is seen in the example config used for testing

name: etcd
data-dir: "default.etcd"
metrics: extensive
snapshot-count: 75000
enable-v2: false
quota-backend-bytes: 8589934592 # 8Gi
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://0.0.0.0:2379
initial-advertise-peer-urls: http://0.0.0.0:2380
initial-cluster: etcd=http://0.0.0.0:2380
initial-cluster-token: new
initial-cluster-state: new
auto-compaction-mode: periodic
auto-compaction-retention: 30m

This was recently added to the repo in this PR to make it easy to test backup-restore with any etcd process running. Pls make sure you have not added any extra configuration apart from the minimum required ones mentioned for initial testing.

@jeremych1000
Copy link
Author

I'm running HA etcd with 3 replicas. I've tried like 10 different combinations of ports and service names, so close yet so far. I also tried initial-cluster-state: existing as I don't want etcd-backup-restore to spin up its own etcd.

With most configs (as with the one above) it errors with the below. Do I have to hardcode the actual etcd pod names in the initial-cluster key maybe?

time="2024-04-29T15:59:52Z" level=info msg="etcd-backup-restore Version: v0.28.0"                                                                                                                                                       
time="2024-04-29T15:59:52Z" level=info msg="Git SHA: 727e957b"                                                                                                                                                                          
time="2024-04-29T15:59:52Z" level=info msg="Go Version: go1.20.3"                                                                                                                                                                       
time="2024-04-29T15:59:52Z" level=info msg="Go OS/Arch: linux/amd64"                                                                                                                                                                    
time="2024-04-29T15:59:52Z" level=info msg="compressionConfig:\n  enabled: true\n  policy: gzip\ndefragmentationSchedule: 0 0 */3 * *\netcdConnectionConfig:\n  caFile: /etc/kubernetes/pki/etcd/ca.crt\n  certFile: /etc/kubernetes/pki
time="2024-04-29T15:59:52Z" level=info msg="Setting status to : 503" actor=backup-restore-server                                                                                                                                        
time="2024-04-29T15:59:52Z" level=info msg="Registering the http request handlers..." actor=backup-restore-server                                                                                                                       
time="2024-04-29T15:59:52Z" level=info msg="Starting the http server..." actor=backup-restore-server                                                                                                                                    
time="2024-04-29T15:59:52Z" level=info msg="Checking if etcd is running"                                                                                                                                                                
time="2024-04-29T15:59:52Z" level=info msg="Starting HTTP server at addr: :8080" actor=backup-restore-server                                                                                                                            
time="2024-04-29T15:59:52Z" level=info msg="Etcd is now running. Continuing br startup" 
time="2024-04-29T15:59:52Z" level=info msg="Updating member peer URL for etcd-backups-server-6d8d7bd5b6-rtkfn" actor=member-add                                                                                            
time="2024-04-29T15:59:52Z" level=error msg="failed to update member peer url: could not fetch member URL : could not parse peer URL from the config file : invalid peer URL : http://0.0.0.0:2380" actor=backup-restore-server 

@renormalize
Copy link
Member

All endpoints exposed by etcd-backup-restore:

mux.HandleFunc("/initialization/start", h.serveInitialize)
mux.HandleFunc("/initialization/status", h.serveInitializationStatus)
mux.HandleFunc("/snapshot/full", h.serveFullSnapshotTrigger)
mux.HandleFunc("/snapshot/delta", h.serveDeltaSnapshotTrigger)
mux.HandleFunc("/snapshot/latest", h.serveLatestSnapshotMetadata)
mux.HandleFunc("/config", h.serveConfig)
mux.HandleFunc("/healthz", h.serveHealthz)
mux.Handle("/metrics", promhttp.Handler())

Seems like we've hit a wall here. There is no way we can tell etcd-backup-restore that it's the only replica running and it should be the leader: etcd-backup-restore couples itself tightly to the etcd running alongside.

In the endpoint list (the --endpoints flag, the client url for the etcd running alongside) that is provided to the etcd-backup-restore in the configuration, etcd-backup-restore checks if the etcd at this endpoint is the leader, and if it is: it becomes the leader in the etcd-backup-restore "cluster", and proceeds to perform the functionality you want from etcd-backup-restore.

Now, to make your etcd-backup-restore single replica the leader, you must somehow provide the client endpoint of the leading etcd member of the HA etcd cluster, so that this singular etcd-backup-restore member considers itself the leader.

How could that be done? I'm at a loss. The fact that etcd-backup-restore by design runs as a single member with a single member etcd, or a 3 member cluster with a 3 member etcd, is what is causing this issue.

Each etcd-backup-restore replica relies on its accompanying etcd for the privilege to take snapshots.

@jeremych1000
Copy link
Author

Update - I got it working if I force the server pod to be on the same node as the current etcd elected master. If it was on any other control plane node it didn't work.

A follow up question then - the POST request to take a snapshot worked, and I got a JSON payload back. However the server itself was still in the loop of taking delta and full snapshots even though I didn't define a schedule?

How can I disable this? I was under the impression using the server keyword would stop any scheduled backups.

@renormalize
Copy link
Member

renormalize commented Apr 29, 2024

The server is an enhancement of the snapshot command as I've explained in detail in #725 (comment). It is designed to always take full snapshots. It lets an operator perform out-of-schedule snapshots.

As explained in the documentation pointed to by the above linked comment:

Etcdbrctl server
With sub-command server you can start a http server which exposes an endpoint to initialize etcd over REST interface. The server also keeps the backup schedule thread running to keep taking periodic backups. This is mainly made available to manage an etcd instance running in a Kubernetes cluster. You can deploy the example helm chart on a Kubernetes cluster to have a fault-resilient, self-healing etcd cluster.

The server also keeps the backup schedule thread running to keep taking periodic backups.

If you don't want delta snapshots, just set the value to lesser than 1.

You can't really disable full snapshots. Set it to a really long period so it doesn't bother you?

@renormalize
Copy link
Member

Anything else @jeremych1000?

@jeremych1000
Copy link
Author

jeremych1000 commented Apr 29, 2024

Anything else @jeremych1000?

Thank you very much for the responsiveness!

One final thing. Can I confirm there needs to be a 1 to 1 mapping (i.e. I can't use initial-cluster to define all the IPs of all the etcd replicas), so if I have 3 etcd I will need to run 3 copies of backup-restore, of which each are colocated on the same node as a replica of etcd.

As long as at least 1 replica succeed in taking backups I'm happy!

@renormalize
Copy link
Member

You're right that there needs to be a 1-1 mapping between an etcd member and an etcd-backup-restore pod.

You run 3 replicas of etcd-backup-restore, where each is colocated on the same node as a replica of the etcd cluster, and the endpoint that is passed to etcd-backup-restore is the endpoint of the etcd member that it is colocated with. This is exactly what etcd-backup-restore does as well while running as a sidecar, be it simpler since the endpoint for the colocated etcd member is just localhost.

Once you maintain these three replicas with a 1:1 mapping, there will always be one etcd-backup-restore that takes snapshots.

Glad we've figured a solution out for you!

@renormalize
Copy link
Member

I'd appreciate it if you can draft a PR by enhancing the docs if the way you're using etcd-backup-restore works as expected for you, since they're missing currently.

The docs to use etcd-backup-restore standalone by a consumer with an already existing etcd cluster that the operator can not touch, are lacking unfortunately. This would help more etcd-backup-restore make more approachable as an option!

@jeremych1000
Copy link
Author

Thanks, will do.

In terms of the schedule flag,

Error: unknown flag: --schedule "0 */1 * * *"                                                                                                                                                                                           
Usage:                                                                                                                                                                                                                                  
  etcdbrctl server [flags]                                                                                                                                                                                                              
                                                                                                                                                                                                                                        
Flags:                                                                                                                                                                                                                                  
unknown flag: --schedule "0 */1 * * *"     

This is my helm template:

    spec:
      containers:
        - name: {{ .Release.Name }}
          image: {{ .Values.image.repo }}/{{ .Values.image.name }}:{{ .Values.image.tag }}
          args:
            - server
            - <...>
            # TODO - figure out how this works, i've tried adding quotes, using | quote, not using etc.
            # it defaults to "0 */1 * * *" which will work for now
            - --schedule {{ .Values.backup.schedule | quote }}

values.yaml

backup:
  strategy: Exponential
  maxBackups: 10 # ONLY USED IF STRATEGY == LIMITBASED
  schedule: "0 */1 * * *" # every hour

@jeremych1000
Copy link
Author

I'm running gardener-project/public/gardener/etcdbrctl:v0.28.0.

@jeremych1000
Copy link
Author

If I use -s {{ .Values.backup.schedule | quote }} instead of --schedule it says

time="2024-04-30T12:43:13Z" level=info msg="etcd-backup-restore Version: v0.28.0"                                                                                                                                                       
time="2024-04-30T12:43:13Z" level=info msg="Git SHA: 727e957b"                                                                                                                                                                          
time="2024-04-30T12:43:13Z" level=info msg="Go Version: go1.20.3"                                                                                                                                                                       
time="2024-04-30T12:43:13Z" level=info msg="Go OS/Arch: linux/amd64"                                                                                                                                                                    
time="2024-04-30T12:43:13Z" level=fatal msg="failed to validate the options: failed to parse int from \"0: strconv.Atoi: parsing \"\\\"0\": invalid syntax" 

@renormalize
Copy link
Member

I'm not able to figure out what the reason could be from the logs you've shared. When I run etcdbrctl with a custom schedule (one full snapshot a day) with delta snapshots disabled, I simply run:

./bin/etcdbrctl server --storage-provider=<PROVIDER> --store-container=<CONTAINER> --store-prefix=<YOUR_PREFIX> --schedule="0 0 * * *" --delta-snapshot-period=0

The only difference I see in your Helm chart and the chart that used to be maintained for etcd-backup-restore previously is that yours has the equal-to sign missing. Maybe give that a shot?

- name: backup-restore
command:
- etcdbrctl
- server
- --schedule={{ .Values.backup.schedule }}

@jeremych1000
Copy link
Author

equal-to sign

Thanks, I could've sworn I tried that before but it now works - I also removed | quote.

I've got the end to end flow working as well. A bit janky maybe, but it works.

What I've done:

  • daemonset with tolerations for control plane nodes (this guarantees a pod on any etcd node)
  • readiness probe was set to port 2379 (therefore replicas which are on control plane nodes which don't have etcd on them, will always be 0/1 ready)
  • daemonset runs the server command, with a schedule we define which takes backups per hour
  • the daemonset is also fronted by a headless service i.e. nslookups return all IPs of active etcd-backup-restore replicas
  • as a backup, we have a cronjob that runs every day
  • this cronjob runs a python script
  • the python runs a nslookup against the headless service and gets IPs of ready replicas
  • it then loops through, sending a POST command to /snapshot/full
  • if it 405's, then that means that etcd wasn't the master
  • when it hits the master etcd, it will 200, and return a successful snapshot result

I couldn't find a way to make the etcd-backup-restores aware / talk to each other (would've loved to send one POST request to any replica, and then they forward it to the 'etcd-backup-restore' leader). I saw the initial-cluster config in etcd-config.yaml but I can't hardcode node names in my helm chart as they are not known ahead of time.

Happy to improve the docs but my usecase seems a bit niche / hacky. Which bits are you interested in for me to document further? Perhaps the requirements for 1-1 mapping?

@renormalize
Copy link
Member

Thanks for summarizing what your setup is to use etcd-backup-restore for your use case, gives us insight into what people interested in etcd-backup-restore would like to use it as.

There are parts which could be generalized but the maintainers will probably take it up some other time, instead of dedicating time and effort right now for that.

I'm sure there's a way to make all the etcd-backup-restore pods aware of each other. I'll look into it when I get time.

@renormalize
Copy link
Member

renormalize commented May 2, 2024

@jeremych1000 the maintainers would like to discuss about the pluggability of etcd-backup-restore with you, and to enhance etcd-backup-restore in the future.

Would you be okay with a call?
Our timezone is Indian Standard Time. Looking forward to your reply. If you're down for a call, will share a meeting link over email.

@jeremych1000
Copy link
Author

@jeremych1000 the maintainers would like to discuss about the pluggability of etcd-backup-restore with you, and to enhance etcd-backup-restore in the future.

Would you be okay with a call? Our timezone is Indian Standard Time. Looking forward to your reply. If you're down for a call, will share a meeting link over email.

Hello that would be useful. I'm in the UK, happy to discuss agenda and meeting times through email.

@jeremych1000
Copy link
Author

@renormalize quick question - with the server keyword, how would I configure what port the server listens on? I now want to run two copies of etcd-backup-restore for every etcd, as we have 2 locations to backup to for redundancy purposes.

I can't specify two buckets, nor can I specify what port the server spins up on (default 8080). With hostNetwork: true this means the second pod errors out.

"Failed to start http server: listen tcp :8080: bind: address already in use"

@renormalize
Copy link
Member

Run the --help flag with server.

➜  etcd-backup-restore git:(master) ./bin/etcdbrctl server --help
Server will keep listening for http request to deliver its functionality through http endpoints.

Usage:
  etcdbrctl server [flags]

Flags:
      --auto-compaction-mode string                        mode for auto-compaction: 'periodic' for duration based retention. 'revision' for revision number based retention. (default "periodic")
...
...
  -p, --server-port uint                                   port on which server should listen (default 8080)

Backing up to two buckets simultaneously is not supported.

@unmarshall
Copy link
Contributor

@jeremych1000 lets discuss your requirements. We are overhauling etcd-backup-restore and would like it very much to understand all of your use cases. This will help us in designing the next version. Can you please prepare the following:

  1. Difficulties faced consuming etcd-backup-restore
  2. Features that are currently not there and provide us with a use case for each so that we can better understand.

@renormalize can you please schedule a meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/accepted Issue was accepted as something we need to work on
Projects
None yet
Development

No branches or pull requests

5 participants