Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve storage fail detection by putting the object #1549

Merged
merged 10 commits into from
Sep 26, 2023

Conversation

ne2pit
Copy link
Contributor

@ne2pit ne2pit commented Sep 5, 2023

Sometimes storage might be alive but work very slow or go to read-only mode. Current fail detector uses ListObject method to check it's liveness. Such check is unreliable.

This PR changes check method to PutObject. Data is pseudo random generated. The object size and the timeout is configurable. Default values are 1MiB and 30s. The object has predefined name always. So there is no additional storage space waste on sequential checks.

@ne2pit ne2pit requested a review from a team as a code owner September 5, 2023 22:14
Copy link
Member

@usernamedt usernamedt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's wait for the context PR to be merged. And the last thing - add a note to the docs about the new setting.

@ne2pit
Copy link
Contributor Author

ne2pit commented Sep 7, 2023

Returned ListFolder check as we discussed. And now our assumption to mix it with context PR become different. Adding ListFolderWithContext is trivial for most of storages except SSH. It is impossible to make it with current SFTP library on the first sight. What do you think? Should we still wait for context PR merge?

@usernamedt
Copy link
Member

usernamedt commented Sep 8, 2023

Returned ListFolder check as we discussed. And now our assumption to mix it with context PR become different. Adding ListFolderWithContext is trivial for most of storages except SSH. It is impossible to make it with current SFTP library on the first sight. What do you think? Should we still wait for context PR merge?

We still have the WALG_FAILOVER_STORAGES_CHECK_TIMEOUT so I guess it is ok to simply call the ListObjects(). Since it is a simple API call and we don't upload anything in it, it is unlikely that it would hang indefinetly.

@usernamedt usernamedt merged commit dd3a0d0 into wal-g:master Sep 26, 2023
76 of 77 checks passed
@ne2pit ne2pit deleted the storage_fail_detection branch September 27, 2023 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants