Add the ability to automate and schedule backups #553

frouioui · 2024-05-13T19:59:57Z

Description

This Pull Request adds a new CRD called VitessBackupSchedule. Its main goal is to automate and schedule backups of Vitess, taking backups of the Vitess cluster at regular intervals based on a given cron schedule and Strategy. This new CRD is managed by the VitessCluster, like most other components of the vitess-operator, the VitessCluster controller is responsible for the whole lifecycle (creation, update, deletion) of the VitessBackupSchedule object in the cluster. Inside the VitessCluster it is possible to define several VitessBackupSchedules as a list, allowing for multiple concurrent backup schedules.

Among other things, the VitessBackupSchedule object is responsible for creating Kubernetes's Job at the desired time, based on the user-defined schedule. It also keeps track of older jobs and delete them if they are too old, according to user-defined parameters (successfulJobsHistoryLimit & failedJobsHistoryLimit). The jobs created by the VitessBackupSchedule object will use the vtctld Docker Image and will execute a shell command that is generated based on the user-defined strategies. The end user can define as many backup strategy per schedule, each of them mocks what vtctldclient is able to do, the Backup and BackupShard commands are available, a map of extra flags enable the user to give as many flag as they want to vtctldclient.

A new end-to-end test is added to our BuildKite pipeline as part of this Pull Request to test the proper behavior of this new CRD.

Related PRs

Documentation: Enhance the getting started guide: scheduled backups vitessio/website#1746
Vitessio/vitess: Update operator.yaml and add schedule backup example vitessio/vitess#15969

Demonstration

For this demonstration I have setup a Vitess cluster by following the steps in the getting started guide, until the very last step where we must apply the 306_down_shard_0.yaml file. My cluster is then composed of 2 keyspaces: customer with 2 shards, and commerce unsharded. I then modify the 306... yaml file to contain the new backup schedule, as seen in the snippet right below. We want to create two schedules, one for each keyspace. The keyspace customer will have two backup strategies: one for each shard.

apiVersion: planetscale.com/v2
kind: VitessCluster
metadata:
  name: example
spec:
  backup:
    engine: xtrabackup
    locations:
      - volume:
          hostPath:
            path: /backup
            type: Directory
    schedules:
      - name: "every-minute-customer"
        schedule: "* * * * *"
        resources:
          requests:
            cpu: 100m
            memory: 1024Mi
          limits:
            memory: 1024Mi
        successfulJobsHistoryLimit: 2
        failedJobsHistoryLimit: 3
        strategies:
          - name: BackupShard
            keyspaceShard: "customer/-80"
          - name: BackupShard
            keyspaceShard: "customer/80-"
      - name: "every-minute-commerce"
        schedule: "* * * * *"
        resources:
          requests:
            cpu: 100m
            memory: 1024Mi
          limits:
            memory: 1024Mi
        successfulJobsHistoryLimit: 2
        failedJobsHistoryLimit: 3
        strategies:
          - name: BackupShard
            keyspaceShard: "commerce/-"
  images:

Once the cluster is stable, all tablets are serving and ready, I re-apply my yaml file with the backup configuration:

$ kubectl apply -f test/endtoend/operator/306_down_shard_0.yaml 
vitesscluster.planetscale.com/example configured

Immidiately I can check that the new VitessBackupSchedule objects have been created.

$ kubectl get VitessBackupSchedule 
NAME                                          AGE
example-vbsc-every-minute-commerce-ac6ff735   7s
example-vbsc-every-minute-customer-8aaaa771   7s

Now I want to check the pods where the jobs created by VitessBackupSchedule are running. After about 2 minutes, we can see four pods, two for each schedule. The pods are marked as Completed as they finished their job.

$ kubectl get pods
NAME                                                           READY   STATUS             RESTARTS        AGE
...
example-vbsc-every-minute-commerce-ac6ff735-1715897700-nkfzx   0/1     Completed          0              79s
example-vbsc-every-minute-commerce-ac6ff735-1715897760-qr4hp   0/1     Completed          0              19s
example-vbsc-every-minute-customer-8aaaa771-1715897700-rbsmd   0/1     Completed          0              79s
example-vbsc-every-minute-customer-8aaaa771-1715897760-kzn8t   0/1     Completed          0              19s
...

Now let's check our backup:

$ ls -l vtdataroot/backup/example/commerce/- vtdataroot/backup/example/customer/80- vtdataroot/backup/example/customer/-80 

vtdataroot/backup/example/commerce/-:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221502.zone1-0790125915
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:16 2024-05-16.221602.zone1-0790125915

vtdataroot/backup/example/customer/-80:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221502.zone1-2289928654
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:16 2024-05-16.221601.zone1-2289928654

vtdataroot/backup/example/customer/80-:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221511.zone1-4277914223
drwxr-xr-x  10 florentpoinsard  staff  320 May 16 16:16 2024-05-16.221609.zone1-2298643297

$ kubectl get vtb --no-headers
example-commerce-x-x-20240516-221502-2f185d5b-1854be28    2m7s
example-commerce-x-x-20240516-221602-2f185d5b-0a248174    67s
example-customer-80-x-20240516-221511-fefbca6f-8ede9c7d   2m7s
example-customer-80-x-20240516-221609-89028361-d9d1c1e4   67s
example-customer-x-80-20240516-221502-887d89ce-2fc618f4   2m7s
example-customer-x-80-20240516-221601-887d89ce-5b5b0acb   66s

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui · 2024-05-24T20:34:29Z

In commit bc74ab4, I have applied one of the most important suggestion discussed above which is to remove the BackupTablet strategy in favor of BackupKeyspace and BackupCluster. The strategies can be used as follows:

# BackupKeyspace
        strategies:
          - name: BackupKeyspace
            cluster: "example"
            keyspace: "customer"

# BackupCluster
        strategies:
          - name: BackupCluster
            cluster: "example"

Meanwhile, the BackupShard strategy does not change. When ran we can see the following command line argument in the job's pod, which gets executed upon creation of the container:

# BackupKeyspace
Args:
      /bin/sh
      -c
      /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/-80 && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/80-

# BackupCluster
Args:
      /bin/sh
      -c
      /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard commerce/- && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/-80 && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/80-

cc @maxenglander @mattlord

maxenglander · 2024-05-28T19:52:45Z

pkg/apis/planetscale/v2/vitessbackupschedule_types.go

+
+	// Cluster defines on which cluster you want to take the backup.
+	// This field is mandatory regardless of the chosen strategy.
+	Cluster string `json:"cluster"`


i'm not sure i follow why this is necessary. my mental model is that a user defines []VitessBackupScheduleTemplate on the ClusterBackupSpec, and so implicitly each VitessBackupScheduleStrategy will be associated with the cluster where ClusterBackupSpec is defined.

That's a good point @maxenglander, it is pretty useless. I ended up removing that field from VitessBackupScheduleStrategy and adding it to VitessBackupScheduleSpec. The VitessCluster controller will come and fill that new field when it create a new VitessBackupSchedule object, that way VitessBackupSchedule is still be able to select existing components given their cluster names to avoid fetching wrong data in the event where we have multiple VitessCluster running in our K8S cluster.

See b30aa09 for the change.

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui · 2024-05-30T18:29:12Z

another thought, might be nice to give users a way to assign annotations, and one or more affinity selection options to the backup runner pods. that way they can influence things scheduling and eviction.

for example, users might not want backup runner pods running on the same nodes as vttablet pods. and they might not want the backup runner pods to get evicted by an unrelated pod after they've been running for a long time.

In e6946fb I have added affinity and annotations in the VitessBackupScheduleTemplate, allowing the user to configure the affinity and annotations they want for their pods that take backups.

pkg/apis/planetscale/v2/labels.go

maxenglander · 2024-05-30T18:45:20Z

pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go

+			return err
+		}
+		if jobStartTime.Add(time.Minute * time.Duration(timeout)).Before(time.Now()) {
+			if err := r.client.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationBackground)); (err) != nil {


seems like a good thing to have a metric for

fixed via 46b6967 + 5809cbd

maxenglander · 2024-05-30T18:47:52Z

pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go

+	return job, nil
+}
+
+func (r *ReconcileVitessBackupsSchedule) createJobPod(ctx context.Context, vbsc *planetscalev2.VitessBackupSchedule, name string) (pod corev1.PodSpec, err error) {


might be worth adding a note about that in release notes. i expect it will be a common issue people run in to.

maxenglander · 2024-05-30T18:51:34Z

pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go

+					if shardIndex > 0 || ksIndex > 0 {
+						cmd.WriteString(" && ")
+					}
+					createVtctldClientCommand(&cmd, vtctldclientServerArg, strategy.ExtraFlags, ks.name, shard)


am i reading this right that it will be taking a backup of each keyspace and shard in sequence? that doesn't seem ideal to me because if each shard takes an hour to backup, and there are 32 shards, then the backup of the first shard and last shard will be more than a day apart.

i think it would be better if there were at least the option of BackupCluster and BackupKeyspace to backup all keyspaces and shards in parallel.

might be better to limit this PR to only support BackupShard for now, and add support for the other options after more consideration into how to implement BackupKeyspace and BackupCluster.

Let's do that, remove those two strategies as part of this PR and I will work on a subsequent PR to add them back with a better approach. This PR is getting lengthy already.

Fixed via 70ba063

IMO BackupAllShardsInKeyspace and BackupAllShardsInCluster are better names. It may seem nitty, but I think it's important as it reflects what it actually is: independent backups of the shards. i.e. it is NOT a single consistent backup of the keyspace or cluster at any physical or logical point in time.

I ended up removing Keyspace and Cluster strategies in this PR as it will require a bigger refactoring. I am keeping that in mind for when we add them though.

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

maxenglander

one last thought, lgtm overall

pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

mattlord

Nice work on this, @frouioui ! ❤️ I only had a few nits/comments that you can address as you feel is best.

.buildkite/pipeline.yml

mattlord · 2024-06-03T15:20:39Z

docs/release-notes/2_13_0_summary.md

+take into account when using this feature:
+
+- If you are using the `xtrabackup` engine, your vttablet pods will need more memory, think about provisioning more memory for it.
+- If you are using the `builtin` engine, you will lose a replica during the backup, think about adding a new tablet.


I think there's a minimum healthy tablet setting? If so, worth mentioning that here IMO.

There is not

test/endtoend/backup_schedule_test.sh

pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go

mattlord · 2024-06-03T16:09:42Z

pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go

+		ks := keyspace{
+			name: item.Spec.Name,
+		}
+		for shardName := range item.Status.Shards {
+			ks.shards = append(ks.shards, shardName)
+		}
+		if len(ks.shards) > 0 {
+			result = append(result, ks)
+		}


Curious why we don't do this instead:

for shardName := range item.Status.Shards { ks.shards = append(result, &keyspace{ name: item.Spec.Name, shards: shardName, }) }

The other allocations/copying seems unnecessary at first glance. When combined with the single shot precise allocation it should be more efficient.

I am not sure I understand what you are suggesting. We still want to create one keyspace object per item in ksList.Items and for all the shards in this item we want to append to keyspace.shards

pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui added 20 commits May 8, 2024 10:30

Add VitessBackupSchedule CRD

7666e88

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Added the VitessBackupSchedule controller for the standalone CRD

51f9653

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Add VitessBackupSchedule to VitessCluster and clean up the CRD

eec5c02

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Clean up VitessCluster controller for VitessBackupSchedule

4586474

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Use proper Docker Image in VitessBackupSchedule and re-add fields

fd30010

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Add backup strategies and fix VitessCluster reconcile loop for vbsc

b6c0150

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Allow for deletion of VitessBackupSchedule

bfe82b0

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Enable backup for strategy 'backupShard'

2e4fd65

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Enable backup for strategy 'backupTablet'

33c23a0

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Refactor VitessBackupSchedule reconciling loop

49166e3

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

List Jobs using a label filter

b961d35

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Support multiple schedules at the same time

9216b59

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Remove non-required logging

4177d3d

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Allow for multiple backup strategies per schedule

def8ac3

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Remove non-required logging

9b1ce77

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Merge remote-tracking branch 'origin/main' into scheduled-backups

dc2aa18

Self-review

062d7a0

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Re-generate CRDs and operator.yaml

d8cec2a

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Add timeout for stale jobs

3347bed

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Add E2E test for scheduled backups

cd77eb3

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui changed the title ~~Add VitessBackupSchedule~~ VitessBackupSchedule add the ability to automate backups May 16, 2024

frouioui changed the title ~~VitessBackupSchedule add the ability to automate backups~~ Add the ability to automate and schedule backups May 16, 2024

frouioui added 3 commits May 16, 2024 16:23

Modify backup strategy to use a flag map

a129b4c

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Revert debug code

614f31d

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Fix the backup schedule test to expect 2 every-five-minute pods

95ff978

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

This was referenced May 17, 2024

Enhance the getting started guide: scheduled backups vitessio/website#1746

Merged

Update operator.yaml and add schedule backup example vitessio/vitess#15969

Merged

Add default to podJobTimeout

ed10321

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui marked this pull request as ready for review May 17, 2024 17:01

frouioui requested a review from mattlord May 17, 2024 17:03

frouioui added 3 commits May 24, 2024 10:41

Change CRD example to be every day instead of every minute

8cbfcad

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Revert errexit change in test script

6e20613

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Add BackupCluster and BackupKeyspace

bc74ab4

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

maxenglander reviewed May 28, 2024

View reviewed changes

frouioui added 4 commits May 28, 2024 15:06

Remove Cluster field in Strategy

b30aa09

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Add release notes

4dc9ce0

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Fix 101 schedule test

9f7cd44

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Use env variable for the golang version

54c9512

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui mentioned this pull request May 30, 2024

Release of v20.0.0-RC1 vitessio/vitess#16010

Open

37 tasks

Add custom annotations and affinity to backup pods

e6946fb

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui requested review from mattlord and maxenglander May 30, 2024 18:29

maxenglander reviewed May 30, 2024

View reviewed changes

frouioui added 4 commits May 30, 2024 14:24

Add more information in the release notes about extraFlags

9739a94

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Fix metrics and add new metrics for timed out jobs

46b6967

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Revert unwanted change

5809cbd

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Revert BackupKeyspace and BackupCluster

70ba063

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui requested a review from maxenglander May 30, 2024 21:24

frouioui added 2 commits May 30, 2024 15:25

Merge remote-tracking branch 'origin/main' into scheduled-backups

6dbd28a

Revert panic issue

f74592b

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

maxenglander approved these changes Jun 3, 2024

View reviewed changes

pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go Outdated Show resolved Hide resolved

Removed Replace concurrency policy

36d1b40

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

mattlord approved these changes Jun 3, 2024

View reviewed changes

frouioui added 3 commits June 3, 2024 12:04

Fix review suggestions

54ec222

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Fix fmt

1458329

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

Fix out of bound

290b3a1

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

frouioui merged commit f754509 into main Jun 3, 2024
10 checks passed

frouioui deleted the scheduled-backups branch June 3, 2024 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to automate and schedule backups #553

Add the ability to automate and schedule backups #553

frouioui commented May 13, 2024 •

edited

frouioui commented May 24, 2024

maxenglander May 28, 2024

frouioui May 28, 2024

frouioui commented May 30, 2024

maxenglander May 30, 2024

frouioui May 30, 2024 •

edited

maxenglander May 30, 2024

maxenglander May 30, 2024

frouioui May 30, 2024

frouioui May 30, 2024

mattlord May 31, 2024 •

edited

frouioui May 31, 2024 •

edited

maxenglander left a comment

mattlord left a comment

mattlord Jun 3, 2024

frouioui Jun 3, 2024

mattlord Jun 3, 2024

frouioui Jun 3, 2024

Add the ability to automate and schedule backups #553

Add the ability to automate and schedule backups #553

Conversation

frouioui commented May 13, 2024 • edited

Description

Related PRs

Demonstration

frouioui commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frouioui commented May 30, 2024

Choose a reason for hiding this comment

frouioui May 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattlord May 31, 2024 • edited

Choose a reason for hiding this comment

frouioui May 31, 2024 • edited

Choose a reason for hiding this comment

maxenglander left a comment

Choose a reason for hiding this comment

mattlord left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frouioui commented May 13, 2024 •

edited

frouioui May 30, 2024 •

edited

mattlord May 31, 2024 •

edited

frouioui May 31, 2024 •

edited