Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lep): add recurring and manual full backup support #8186

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
211 changes: 211 additions & 0 deletions enhancements/20240314-recurring-and-manual-full-backup-support.md
@@ -0,0 +1,211 @@
# 20240314-recurring-and-manual-full-backup-support

## Summary

This feature enables Longhorn to create **recurring job** for full backup or **manually trigger** the full backup of the volume.

### Related Issues

- Community issue: https://github.com/longhorn/longhorn/issues/7069
- Improvement issue: https://github.com/longhorn/longhorn/issues/7070

## Motivation

Longhorn always does incremental backup which only backup the newly updated blocks.
There is a chance that the previous backup blocks on the backupstore are corrupted. In this case, users can not restore the volume anymore because Longhorn aborts the restoration when it finds those blocks have different checksum.

### Goals
ChanYiLin marked this conversation as resolved.
Show resolved Hide resolved

- Add a new fields `Parameters` to `RecurringJob` and `Backup`
- `backup-mode`: used in `Backup` CR to trigger the full backup (Options: `"full"`, `"incremental"`, default to `"incremental"` for always incremental)
- `full-backup-interval`: used in `RecurringJob - Backup Type` to execute full backup every N incremental backups (default to 0 for always incremental)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: It's better to provide an example in the description: If N is 5, it means that after 5 regular incremental backups, the 6th backup will be the full one.

- When doing full backup, Longhorn will backup **all the current blocks** of the volume and **overwrite them** on the backupstore even if those blocks already exists on the backupstore.
- Collect metrics of `newly upload data size` and `overwritten data size` for user to better understand the cost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can Longhorn collect this info if it's not recorded in Backup CR?


## Proposal

### User Stories

### User Experience In Detail

#### Recurring Full Backup - Always

1. Create a `Backup` task type RecurringJob with the parameter `full-backup-interval: 0` and assign it to the volume
```
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: recurring-full-backup-per-min
namespace: longhorn-system
spec:
concurrency: 1
cron: '* * * * *'
groups: []
labels: {}
parameters:
full-backup-interval: 0
derekbit marked this conversation as resolved.
Show resolved Hide resolved
name: recurring-full-backup-per-min
retain: 0
task: backup
```
2. The RecurringJob runs and fully backup the volume every time.

#### Recurring Full Backup - Every N Incremental Backups

1. Create a `Backup` task type RecurringJob with the label `full-backup-interval: 5` and assign it to the volume
```
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: recurring-full-backup-per-min
namespace: longhorn-system
spec:
concurrency: 1
cron: '* * * * *'
groups: []
labels: {}
parameters:
full-backup-interval: 5
name: recurring-full-backup-per-min
retain: 0
task: backup
```
2. The RecurringJob runs and fully backup the volume every 5 incremental backups.

#### Manual Full Backup

1. When creating backup, users can check the checkbox `Full Backup: []` or add the parameters to the spec `backup-mode: full`.
2. The backup will be full backup.
3. Maybe adjust the UI to make the process more simple
ChanYiLin marked this conversation as resolved.
Show resolved Hide resolved

## Design

### Implementation Overview

#### Metrics

1. Add two new fields `new upload data size`, `reupload data size` to the Backup Status

#### UI

1. In **Volume Page** >> **Create Backup** , add a checkbox `Full Backup: []`
- If it is checked, automatically add the parameters `backup-mode: full` to the request payload
- For example:
```
HTTP/1.1 POST /v1/volumes/${VOLUME_NAME}?action=snapshotBackup
Host: localhost:8080
Accept: application/json
Content-Type: application/json
Content-Length: 55
{
"parameters": {
"backup-mode": "full",
},
"name": ${BACKUP_NAME},
}
```

2. In **Recurring Jo** >> **Create Recurring Job**, add a new sector for user to fill in the parameters when the task is `Backup` related task.
- Currently only support:
- `full-backup-interval`

3. In **Backup** >> **${BackVOlume}**, add a new field `Backup Mode`
- If it has the parameters `backup-mode: full`, show `full`, otherwise show `incremental`

#### CRD

1. **BackupVolume**: add a new status `.Status.BackupCount` to record how many backups have been created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this new status field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow users to set the interval of full backup during a series of incremental backup
So we need to know the current backup times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does backup time mean the number of full backup? When will it be reset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It starts when the BackupVolume is created and is stored on the backupstore
it meas how many times this Volume has backed up no matter if it is full or incremental

The parameter is full-backup-interval: 5 in recurring backup job.
So every 5 times backup, it will do full backup
It used the backupCount in BackupVolume to check.

Copy link
Member

@innobead innobead May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the purpose, but not sure why we want to record it. I feel it's for metrics, but at least it is not explained in this PR.


2. **Backup**: add a new fields `parameters` to pass the backup options.
- `backup-mode`: `"full"` to trigger full backup. Default to `"incremental"` for incremental backup

3. **RecurringJob**: add a new fields `parameters`.
- `full-backup-interval`: Only used in `Backup` related task. Execute full backup every N incremental backups. Default to 0 for always incremental

Backup CR Example
```yaml
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
name: backup-abcde1234
namespace: longhorn-system
spec:
snapshot: fake-snapshot
parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have a spec.mode instead of having another parameters? It seems a redundant layer? The default value is incremental.

backup-mode: full
```

#### Backupstore
1. Need to pass `parameters` through the grpc function call chain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pass mode. The gRPC method call can be backward compatible, so passing a specific parameter is better for a clear interface protocol.

2. In our implementation, if the Volume has `lastBackup`, we then always perform incremental Backup.
3. Now, if `backup-mode: full` exists in the parameters,
- we then pretend the last Backup does not exist and force it to do the full Backup.
- overwrites the block on the backupstore even it already exists.
4. store the `new upload data size`, `reupload data size` to the Backup Status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does new upload data size mean delta uploaded data? reupload ata size means full backup?

If yes, suggest the following naming. WDYT?

  • incrementa-backup/data-size
  • full-backup/data-size

Copy link
Contributor Author

@ChanYiLin ChanYiLin May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly
one can keep doing full-backup but each time with new data block (he never use incremental upload for example)
the block is then not from incremental-backup/data-size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure. Can you explain again what is new upload data size and reupload data size?


#### Webhook
1. Check the parameters to prevent from typo.
2. ReucrringJob currently only accept `full-backup-interval`
3. Backup currently only accept `backup-mode`

### Test plan
ChanYiLin marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a test case for DR.

Recurring backup with a full backup period, then see if DR volume can be activated with correct data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see the DR case here.


#### Manually Full Backup
1. Create a Volume 4MB and fill in the content.
2. Create a Backup of the Volume.
3. Intentionally replace the content of the first block(2MB) on the backupstore
4. Restore the Volume, and will get error logs like below
```
[pvc-XXXXXX] time="XXXX" level=error msg="Backup data restore Error Found in Server[gzip: invalid checksum]"
```
5. Create a full backup with the parameter `backup-mode: full`
6. Restore the backup, this time should work

#### Recurring Job Full Backup
1. Create a Volume 4MB and fill in the content.
2. Create a Backup of the Volume.
3. Intentionally replace the content of the first block(2MB) on the backupstore
4. Restore the Volume, and will get error logs like below
```
[pvc-XXXXXX] time="XXXX" level=error msg="Backup data restore Error Found in Server[gzip: invalid checksum]"
```
5. Create a `backup` task type RecurringJob with the parameter `full-backup-interval: 1` and assign it to the volume
```
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: recurring-full-backup-per-min
namespace: longhorn-system
spec:
concurrency: 1
cron: '* * * * *'
groups: []
labels: {}
parameters:
full-backup-interval: 1
name: recurring-full-backup-per-min
retain: 0
task: backup
```
6. Wait for the recurring job to be finished.
7. Restore the backup, this time should work


#### Concurrent Backup

1. Create a Volume 4MB and fill in the content.
2. Create 3 recurring job for every 1 min, two for normal incremental backup and the other one for full backup
3. These 3 recurring job should be triggered at once.
4. Wait for 3 backup to be finished.
5. Restore the last backup of the BackupVolume.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check each backup restore all work (w/ correct, but this seems unable to verify)

6. The content should be the same as original Volume.


### Upgrade strategy

No need.

## Note [optional]

None.