Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Better garbage collection of snapshots #621

Open
3 of 4 tasks
shreyas-s-rao opened this issue May 3, 2023 · 6 comments
Open
3 of 4 tasks

[Feature] Better garbage collection of snapshots #621

shreyas-s-rao opened this issue May 3, 2023 · 6 comments
Assignees
Labels
area/backup Backup related area/ops-productivity Operator productivity related (how to improve operations) area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/1 Priority (lower number equals higher priority)

Comments

@shreyas-s-rao
Copy link
Collaborator

shreyas-s-rao commented May 3, 2023

Feature (What you would like to be added):
Better garbage collection of snapshots.

Motivation (Why is this needed?):
Current garbage collection policy of Exponential is hard-coded to a certain schedule of full and delta snapshots. This does not work well for full snapshot schedules configured differently than the expected schedule of "once per hour" or "once per day". Additionally, delta snapshots are retained for only the past 24 hours. This needs to be made configurable so that delta snapshots can be persisted for a longer period, enough for operators to debug any possible issues/bugs in productive environments.

@shreyas-s-rao shreyas-s-rao added kind/enhancement Enhancement, improvement, extension area/ops-productivity Operator productivity related (how to improve operations) area/robustness Robustness, reliability, resilience related area/backup Backup related priority/1 Priority (lower number equals higher priority) labels May 3, 2023
@shreyas-s-rao
Copy link
Collaborator Author

/assign @seshachalam-yv

@seshachalam-yv
Copy link
Contributor

seshachalam-yv commented Jun 7, 2023

@gardener/etcd-druid-maintainers,

I would like to propose an enhancement to the current Exponential policy for snapshot retention. Below, I outline the current implementation and my proposed changes.

Exponential Policy

Current Implementation:

How Does Garbage Collection Work?

  • The most recent full snapshot and corresponding delta snapshots are retained.
  • All other delta snapshots are deleted.
  • All full snapshots from the current hour will be retained.
  • The most recent full snapshot from each hour of the last 24 hours is retained.
  • The most recent full snapshot from each day of the last 7 days is retained.
  • The most recent full snapshot from each week of the last 4 weeks is retained.

Problem: The current implementation lacks flexibility as it does not allow for configuration of the retention period for delta or full snapshots. Snapshots are only retained for 4 weeks.

Proposed Improvement (with Backward Compatibility):

The Proposal is to enhance flexibility by introducing two new flags to adjust the snapshot retention period:

  • deltaSnapshotRetentionPeriod: This flag sets the retention duration for the most recent delta snapshots, with a default value of 24 hours.
  • fullSnapshotRetentionPeriod: This flag sets the retention duration for the most recent full snapshots, with a default value of 31 days.

These default values have been chosen to mirror the existing garbage collection values as closely as possible, ensuring backward compatibility and preventing disruption to existing configurations.

The garbage collection process, given these retention periods, will be managed as follows:

  • The latest full snapshot and its corresponding delta snapshots will always be retained, regardless of the deltaSnapshotRetentionPeriod and fullSnapshotRetentionPeriod configurations. This is crucial to safeguard potential data restoration.
  • All delta snapshots within the deltaSnapshotRetentionPeriod will be retained.
  • Full snapshots within the fullSnapshotRetentionPeriod will be subject to the following rules for retention:
    • All full snapshots from the current hour will be retained.
    • The most recent full snapshot from each hour of the past 24 hours will be retained.
    • The most recent full snapshot from each day of the past 7 days will be retained.
    • The most recent full snapshot from each week of the past 4 weeks will be retained.
    • As a new feature, we will now retain the most recent full snapshot from each month of the preceding months.

Note It is important to note that any full snapshot that is older than the fullSnapshotRetentionPeriod will be immediately deleted, regardless of its status as the most recent snapshot from a specific hour, day, week, or month.

graph TB
    Start --> A[Is it the latest full or latest delta snapshot ?]
    A -->|Yes| B[Retain]
    A -->|No| C[Is it a full snapshot?]
    C -->|Yes| D[Is the full snapshot within fullSnapshotRetentionPeriod?]
    D -->|Yes| E1[Is it from the current hour?]
    E1 -->|Yes| B
    E1 -->|No| E2[Is it the most recent from the past 24 hours?]
    E2 -->|Yes| B
    E2 -->|No| E3[Is it the most recent from the past 7 days?]
    E3 -->|Yes| B
    E3 -->|No| E4[Is it the most recent from the past 4 weeks?]
    E4 -->|Yes| B
    E4 -->|No| E5[Is it the most recent from the preceding months?]
    E5 -->|Yes| B
    E5 -->|No| F[Delete]
    D -->|No| F
    C -->|No| G[Is the delta snapshot within deltaSnapshotRetentionPeriod?]
    G -->|Yes| B
    G -->|No| F

@ashwani2k
Copy link
Collaborator

ashwani2k commented Jun 8, 2023

Thanks for the neat write-up @seshachalam-yv.

These default values have been chosen to mirror the existing garbage collection values as closely as possible, ensuring backward compatibility and preventing disruption to existing configurations.

Shall we also capture what would be a good value for these values like for e.g. keeping delta snapshots for 14 days.
Once the values are set are their any constraint on changing these values. What happens once lower values are set lets say change deltaRetentionPeriod to 7 days from 14 days. May be explicitly mentioning that might be good for reference.

The most recent full snapshot from each hour of the past 24 hours will be retained.

I've always found this difficult to comprehend, can you may be also attach a sample of how this will look for a week's period. Just placeholders in the folder structure as defined for blob store.

As a new feature, we will now retain the most recent full snapshot from each month of the preceding months.

What's the usecase for this? As mostly PITR is not possible so having more backups here doesn't really help much, given the auto-restoration from last known good delta/full snapshot.

Nit Pick
I believe you wanted for instead of or in the below sentence.
image

@unmarshall
Copy link
Contributor

unmarshall commented Jun 12, 2023

@seshachalam-yv retention algorithm (current) is quite complex. I would really like to get the reasoning (maybe there is sound reasoning) which made us make this simple thing so complicated. Is it possible for you to provide reasoning for why is the following required (this will add context/background on why things are the way they are):

All full snapshots from the current hour will be retained.
The most recent full snapshot from each hour of the past 24 hours will be retained.
The most recent full snapshot from each day of the past 7 days will be retained.
The most recent full snapshot from each week of the past 4 weeks will be retained.
As a new feature, we will now retain the most recent full snapshot from each month of the preceding months.

This only increases code complexity for sure. Since we are re-looking at this topic, lets take this opportunity to understand the use cases that require such complication and if there are no use-cases (which in turn means that its just technical-complexity-debt) then lets simplify this so that understanding, maintaining and consuming this becomes quite easy

@seshachalam-yv
Copy link
Contributor

@ashwani2k

I've always found this difficult to comprehend, can you may be also attach a sample of how this will look for a week's period. Just placeholders in the folder structure as defined for blob s@gardener/etcd-druid-maintainers,

I understand the concept might initially seem difficult to grasp. For illustration, let's consider a time period from 2023-06-11 11:00 to 2023-06-11 14:00, with full snapshots taken every 15 minutes. After three hours, we would have 12 snapshots, as shown in the Gantt chart below.

gantt
    dateFormat  YYYY-MM-DD HH:mm
    axisFormat %Y-%m-%d-%H:%M
    title Garbage collection (GC)  Exponential Policy
    section Full Snapshots
    
    full 1 :crit,   full1, 2023-06-11 11:00, 15m
    full 2 :crit, full2, after full1, 15m
    full 3 :crit, full3, after full2, 15m
    full 4 :active, full4, after full3, 15m
    [Retain] Most recent full snapshot from (11-12) - full 4: retain,2023-06-11 11:00, 1h
    full 5 :crit,   full5, after full4, 15m
    full 6 :crit, full6, after full5, 15m
    full 7 :crit,   full7, after full6, 15m
    full 8 :active, full8, after full7, 15m
   [Retain] Most recent full snapshot from (12-13) - full 8: retain, after full4, 1h
    full 9 :active,   full9, after full8, 15m
    full 10 :active, full10, after full9, 15m
    full 11 :active,   full11, after full10, 15m
    full 12 :active, full12, after full11, 15m
   [Retain] All full snapshots from current hour (13-14): retain, after full8, 1h
    fullSnapshotRetentionPeriod is 31 days: fullSnapshotRetentionPeriod, 2023-06-11 11:00, 3h

In line with our policy, only the latest full snapshot, full 12 in this case, is retained. According to our first rule:

The latest full snapshot and its corresponding delta snapshots will always be retained, regardless of the deltaSnapshotRetentionPeriod and fullSnapshotRetentionPeriod configurations. This is crucial to safeguard potential data restoration.

Snapshots full 11, full 10, and full 9 are retained as they are within the fullSnapshotRetentionPeriod (the default value is 31 days). As per our first GC rule:

All full snapshots from the current hour will be retained.

Snapshot full 8 is retained, and the following snapshots full 7, full 6, full 5 are garbage collected or deleted from the store, even though they are within the fullSnapshotRetentionPeriod. We only retain full 8 according to the following GC rule:

The most recent full snapshot from each hour of the past 24 hours will be retained.

Thus, full 8 is the most recent full snapshot for the hour between 2023-06-11 12:00 and 2023-06-11 13:00. This rule is applied similarly for the hour 2023-06-11 11:00 to 2023-06-11 12:00, with full 4 being retained and full 3, full 2, and full 1 being garbage collected as they are not the most recent snapshots of that hour.

To clarify our terminology:

Latest Snapshot: In our example, full 12 is the latest full snapshot. Following the latest full snapshot, the corresponding delta snapshots are considered as the latest delta snapshots.

Most Recent Snapshot: The most recent full snapshot from each hour of the past 24 hours is retained. In our example, full 8 is the most recent snapshot from the 12:00-13:00 hour, and full 4 from the 11:00-12:00 hour.

Current Hour: All full snapshots from the current hour are retained. Here, the current hour is 13:00-14:00, so full 9, full 10, full 11, and full 12 are retained.

This example covers a span of 3 hours with full snapshots taken every 15 minutes for simplicity and clarity. However, the same logic applies to larger spans of time such as days, weeks, or months according to the GC rules. The visualization provided in the Gantt chart should help illustrate how these rules work in practice.

@seshachalam-yv
Copy link
Contributor

What happens once lower values are set lets say change deltaRetentionPeriod to 7 days from 14 days. May be explicitly mentioning that might be good for reference.

Let's consider the same time period from 2023-06-11 11:00 to 2023-06-11 14:00 and adjust the fullSnapshotRetentionPeriod to 2 hours. In this scenario, the GC rules are still followed in the same order, but the snapshots eligible for retention would change due to the shorter fullSnapshotRetentionPeriod.

gantt
    dateFormat  YYYY-MM-DD HH:mm
    axisFormat %Y-%m-%d-%H:%M
    title Garbage collection (GC)  Exponential Policy (fullSnapshotRetentionPeriod = 2h)
    section Full Snapshots
    
    full 1 :crit,   full1, 2023-06-11 11:00, 15m
    full 2 :crit, full2, after full1, 15m
    full 3 :crit, full3, after full2, 15m
    full 4 :crit, full4, after full3, 15m
    GC [deleted] all snapshots older than fullSnapshotRetentionPeriod : deleted, 2023-06-11 11:00, 1h
    full 5 :crit,   full5, after full4, 15m
    full 6 :crit, full6, after full5, 15m
    full 7 :crit,   full7, after full6, 15m
    full 8 :active, full8, after full7, 15m
    [Retain] Most recent full snapshot from (12-13) - full 8: retain, after full4, 1h
    full 9 :active,   full9, after full8, 15m
    full 10 :active, full10, after full9, 15m
    full 11 :active,   full11, after full10, 15m
    full 12 :active, full12, after full11, 15m
    [Retain] All full snapshots from current hour (13-14): retain, after full8, 1h
    fullSnapshotRetentionPeriod is now 2 hours: fullSnapshotRetentionPeriod, after full4, 2h

In this case, the snapshots full 12, full 11, full 10, and full 9 are still retained as they fall within the fullSnapshotRetentionPeriod of 2 hours, and they're also the latest full snapshots. According to our GC rules:

The latest full snapshot and its corresponding delta snapshots will always be retained, regardless of the deltaSnapshotRetentionPeriod and fullSnapshotRetentionPeriod configurations.

However, snapshots full 1 full 2, full 3, full 4 are now all garbage collected, as they fall outside the fullSnapshotRetentionPeriod of 2 hours.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup Backup related area/ops-productivity Operator productivity related (how to improve operations) area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/1 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

5 participants