Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSFS | when FS is unavailable requests fail immediately with InternalError exhausting client attempts #8039

Open
guymguym opened this issue May 10, 2024 · 0 comments
Labels
NS-FS Type:Enhancement New suggestions for behaviours
Milestone

Comments

@guymguym
Copy link
Member

Environment info

  • NooBaa Version: 5.15
  • Platform: NA

Actual behavior

  1. When a Filesystem is not available temporarily or during failover to another node, the system-calls may return errors such as ESTALE or EIO while it is unavailable - see for example GPFS.
  2. This can happen when the FS daemon is down, or some other internal component of the FS or device is not responding.
  3. It is not always possible to differentiate accurately if this state is transient or long term, but the assumption is that it would be resolved sooner or later, either by automated failover to another node, automated recovery, or manual.
  4. NSFS does not identify these errors specifically, and treats these errors as general internal errors, and therefore returns InternalError to the S3 client request immediately.
  5. S3 clients vary in their retries options and configuration, but most clients will have a few retry attempts (3-5) with some exponential backoff.
  6. For some S3 clients the retry attempts and strategy is easily configurable (e.g aws cli) but for other clients it is not always so configurable, and on top of that it becomes the application responsibility to adapt to the storage failure modes which is error prone.
  7. Even if the Filesystem is able to recover or failover, S3 clients might give up due to retries exhausted.

Expected behavior

  1. S3 endpoints should be able to hold the client requests (as long as it was not timeout/aborted by the client).
  2. By holding the request the client will have better chances to overcome the temporary unavailability.
  3. NSFS filesystem calls should be wrapped with a retry_temporary_fs_errors function that will be able to detect temporary errors, and keep the context alive and retry the FS call with some backoff.
  4. An important point is that this retry has to also detect if the client request was aborted.
  5. We would need config options to control the maximum time/count to retry and be able to enable/disable this behavior to fit into different systems.
  6. We also need to be able to differentiate the temporary unavailability errors from others, for example ESTALE is not always representing a recoverable error, so this detection may differ between different FS backends.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NS-FS Type:Enhancement New suggestions for behaviours
Projects
None yet
Development

No branches or pull requests

2 participants