Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pghoard: ignore delta backup failures counter in some cases #621

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

egor-voynov-aiven
Copy link

@egor-voynov-aiven egor-voynov-aiven commented Apr 9, 2024

Pghoard will try making backup in this cases, еven if the retries are over:

  1. Backup was requested by operator (called related http endpoint)
  2. More than backup_interval have passed since the last unsuccessful attempt

[BF-2390]

Pghoard can stuck in state, when it doesn't make backup after several failures. It just writes in log Giving up backup after exceeding max retries and only restart can help.

Cases:
1. Backup was requested by operator: `AVN-PROD service request-backup`
2. More than `backup_interval` have passed since the last unsuccessful attempt

[BF-2390]
@egor-voynov-aiven egor-voynov-aiven changed the title pghoard: convert 'metadata["backup-reason"]' value to enum pghoard: ignore delta backup failures counter in some cases Apr 10, 2024
@egor-voynov-aiven egor-voynov-aiven marked this pull request as ready for review April 10, 2024 10:26
On CI environment threads don't have enough time for initialization.
since_last_fail_interval = utc_now() - last_failed_time if last_failed_time else None
if metadata["backup-reason"] == BackupReason.requested:
self.log.info("Re-trying delta basebackup. Backup was requested")
elif backup_interval and since_last_fail_interval and since_last_fail_interval > backup_interval:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to discuss whether we need to consider backoff retry internal for this two new cases. Just keeping retrying without waiting a bit seems less meaningful to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants