Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Etcdbrctl restore throws FATAL - results in event revision 1 higher than expected snapshot revision verification failed for delta snapshot #583

Open
Starblade42 opened this issue Feb 7, 2023 · 3 comments
Labels
kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age)

Comments

@Starblade42
Copy link

Starblade42 commented Feb 7, 2023

Etcdbrctl snapshot revision verification failed for delta snapshot

Describe the bug:
When doing a restore from a full backup and incremental backups, we encountered a failure mid-restore:

FATA[0156] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002039293-13002040428-1675782244 : mismatched event revision while applying delta snapshot, expected 13002040428 but applied 13002040429

This is after restoring ~15 of ~30 incremental backups.

It actually occurred at different places across multiple attempts to restore:

FATA[0159] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002040429-13002041241-1675782247 : mismatched event revision while applying delta snapshot, expected 13002041241 but applied 13002041242
FATA[0157] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002041242-13002042199-1675782251 : mismatched event revision while applying delta snapshot, expected 13002042199 but applied 13002042200

Full backup schedule is: */30 * * * *
Incremental backup schedule: 60s
Storage retention policy is Exponential

Expected behavior:

I'd either expect the restoration to complete with the same event revision, or if it's okay that the restored revision is a smidge higher, that it could info-log or warn-log and continue.

Ultimately, I'd expect the restoration to work from these incremental backups.

The main question is: is it okay for the revision in the etcd process we're using for the restore for the increment to be slightly higher? Is that actually a fatal error, or something expected when we've restored a 6.7 GB Full Backup, and ~15 or so 10 mb incremental snapshots?

If it's expected, can we info/warn and continue, or is this actually fatal?

How To Reproduce (as minimally and precisely as possible):

This is the first time I've seen this, but it's been on an Etcd cluster with 3 members, with ~750 namespaces and a typical Full snapshot size of 6 GB, and incremental snapshots between 10 MB and 20 MB.

After taking etcd offline, and deleting the member and WAL file directories and restoring, somewhere in the middle of those incremental backups will end up with a event revision that is 1 higher than expected.

Logs:

INFO[0157] Applying delta snapshot Incr-13002037451-13002038085-1675782235  actor=restorer
INFO[0157] Applying delta snapshot Incr-13002038086-13002039292-1675782240  actor=restorer
INFO[0157] Applying delta snapshot Incr-13002039293-13002040428-1675782244  actor=restorer
INFO[0158] Applying delta snapshot Incr-13002040429-13002041241-1675782247  actor=restorer
INFO[0158] Cleanup complete                              actor=restorer
ERRO[0158] Restoration failed.                           actor=restorer
{"level":"info","ts":"2023-02-07T16:30:50.176Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"default","data-dir":"/mnt/ramdisk/etcd","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:558","msg":"stopping serving peer traffic","address":"127.0.0.1:11233"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:565","msg":"stopped serving peer traffic","address":"127.0.0.1:11233"}
{"level":"info","ts":"2023-02-07T16:30:50.316Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"default","data-dir":"/mnt/ramdisk/etcd","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
FATA[0159] Failed to restore snapshot: snapshot revision verification failed for delta snapshot Incr-13002040429-13002041241-1675782247 : mismatched event revision while applying delta snapshot, expected 13002041241 but applied 13002041242

Screenshots (if applicable):

Environment (please complete the following information):

  • Etcd version/commit ID :
etcdctl version: 3.5.0
API version: 3.5
  • Etcd-backup-restore version/commit ID:
INFO[0000] etcd-backup-restore Version: v0.20.1
INFO[0000] Git SHA: 0893f648
INFO[0000] Go Version: go1.18.6
INFO[0000] Go OS/Arch: linux/amd64
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:
    Azure

Anything else we need to know?:

We are using etcdbrctl just as a backup agent most times with a sort of hand-rolled etcd systemd quorum. We've been using it successfully for several years, but wanted to improve our mean time to recovery with a snapshotting and restore tool.

I really wanted to use it in the Server mode, but that would have required completely refactoring or even rewriting our etcd bootstrap to harmonize with the Server so, it's a systemd unit that backs up, and when it has to restore, we then run the restore ourselves after our Prometheus alerts tell us the quorum is bogus.

@Starblade42 Starblade42 added the kind/bug Bug label Feb 7, 2023
@ishan16696
Copy link
Member

Right now, we also don't have much information regarding this scenario as this seems to be a very rare case. I will check what is the issue here.

@danV111
Copy link

danV111 commented Mar 30, 2023

Hello! Are there any updates on this?

@Starblade42
Copy link
Author

So, in our case, we used a workaround. We use the copy function of etcdbrctl to get the last full backup, and then we only restore from the last full backup, and this causes no issues. This isn't ideal, but it's not typically going to result in huge impact, especially when there are no actual alternatives if we completely lost etcd quorum. We only use this method if the typical full snapshot restore + all incremental restores fails

This issue does not occur every time, but often occurs on our largest k8s clusters running with the largest number of objects. With the settings

  • Full backups every 30m
  • Incremental backups every 1m

We end up with typical Delta backups of ~10Mb

Somewhere around restoring from a Full backup with ~15 Delta backups is when we see this issue, and it's usually off by about 1 etcd commit.

If we use more frequent settings then this issue doesn't seem to occur. (For example, 20 minute full backups, 30 second incrementals)

@ishan16696 ishan16696 removed their assignment Aug 22, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age)
Projects
None yet
Development

No branches or pull requests

4 participants