doc(enhancement): add recurring and manual full backup support

ref: longhorn/longhorn 7070 Signed-off-by: Jack Lin <jack.lin@suse.com>
longhorn · Mar 14, 2024 · ecff127 · ecff127
1 parent 42c0757
commit ecff127
Showing 1 changed file with 124 additions and 0 deletions.
diff --git a/enhancements/20240314-recurring-and-manual-full-backup-support.md b/enhancements/20240314-recurring-and-manual-full-backup-support.md
@@ -0,0 +1,124 @@
+# 20240314-recurring-and-manual-full-backup-support
+
+## Summary
+
+This feature enables Longhorn to **periodically** or **manually** fully backup the volume by adding a new recurring job task `Full Backup`.
+
+### Related Issues
+
+- Community issue: https://github.com/longhorn/longhorn/issues/7069
+- Improvement issue: https://github.com/longhorn/longhorn/issues/7070
+
+## Motivation
+
+Longhorn always does incremental backup which only backup the newly updated blocks.
+There is a chance that the previous backup blocks on the backupstore are corrupted. In this case, users can not restore the volume anymore because Longhorn aborts the restoration when it finds those blocks have different checksum.
+
+### Goals
+
+- Add a new task type `Full Backup` which will periodically fully backup the Volume.
+- Add a new API which allows user to manually fully backup the Volume
+- When doing full backup, Longhorn will backup **all the current blocks** of the volume and **overwrite them** on the backupstore even if those blocks already exists on the backupstore.
+
+
+## Proposal
+
+### User Stories
+
+User can create a RecurringJob with `spec.task=full-backup` and associating it with volumes.
+
+### User Experience In Detail
+
+#### Recurring Full Backup
+
+1. Create RecurringJob with the `full-backup` task type and assign it to the volume
+```
+apiVersion: longhorn.io/v1beta2
+kind: RecurringJob
+metadata:
+  name: recurring-full-backup-per-min
+  namespace: longhorn-system
+spec:
+  concurrency: 1
+  cron: '* * * * *'
+  groups: []
+  labels: {}
+  name: recurring-full-backup-per-min
+  retain: 0
+  task: full-backup
+```
+2. The RecurringJob runs and fully backup the volume.
+
+#### Manual Full Backup
+
+1. When creating backup, user can select the checkbox `full backup`.
+2. UI will then automatically add the label`longhorn-backup-mode: full` to the backup create request.
+3. The backup will be full backup.
+
+## Design
+
+### Implementation Overview
+
+#### UI
+
+1. When creating backup, user can select the checkbox `full backup`.
+2. UI will then automatically add the label`longhorn-backup-mode: full` to the backup create request.
+3. The backup will be full backup.
+
+
+#### CRD
+
+1. **RecurringJob**: add a new `RecurringJobType: full-backup`.
+2. **Backup**: add a new reserved Longhorn label `longhorn-backup-mode: full`
+    - When the Backup CR has such label, it will perform full backup.
+    - Using label so we can filter Backup by the label to distinguish the full backup and normal backup easily.
+    - We already store the label to the backupstore when doing backup. Thus, when we pull the Backup from the backupstore in a new cluster, the label will be pulled as well.
+    - This label is only used when the backup happens and tell the engine/replica to do the full backup.
+    - Since we already pass `Label` through the grpc call chain from `longhorn-manager`->`longhorn-instance-manager`->`longhorn-engine/replica`->`backupstore`. So we don't need to do much modification.
+
+Backup CR Example
+```yaml
+apiVersion: longhorn.io/v1beta2
+kind: Backup
+metadata:
+  name: backup-abcde1234
+  namespace: longhorn-system
+spec:
+  snapshot: fake-snapshot
+  labels:
+    longhorn-backup-mode: full
+```
+
+#### Backupstore
+0. In our implementation, if the Volume has `lastBackup`, we then always perform incremental Backup.
+1. Now, if `longhorn-backup-mode: full` exists in the label,
+    - we then pretend the last Backup does not exist and force it to do the full Backup.
+    - overwrites the block on the backupstore even it already exists.
+
+### Test plan
+
+1. Create a Volume 4MB and fill in the content.
+2. Create a Backup of the Volume.
+3. Intentionally replace the content of the first block(2MB) on the backupstore
+4. Restore the Volume, and will get error logs like below
+    ```
+    [pvc-XXXXXX] time="XXXX" level=error msg="Backup data restore Error Found in Server[gzip: invalid checksum]"
+    ```
+5. Create a full backup by manually trigger the API (use UI if UI is ready)
+    ```
+    curl -u "${RANCHER_ACCESS_KEY}:${RANCHER_SECRET_KEY}" \
+    -X POST \
+    -H 'Accept: application/json' \
+    -H 'Content-Type: application/json' \
+    -d '{"name":"axaxaxaxa", "isFull":true}' \
+    'http://localhost:8080/v1/volumes/${volName}?action=snapshotBackup'
+    ```
+6. restore the backup, this time should work
+
+### Upgrade strategy
+
+No need.
+
+## Note [optional]
+
+None.