NSFS | when FS is unavailable requests fail immediately with InternalError exhausting client attempts #8039

guymguym · 2024-05-10T12:11:09Z

When a Filesystem is not available temporarily or during failover to another node, the system-calls may return errors such as ESTALE or EIO while it is unavailable - see for example GPFS.
This can happen when the FS daemon is down, or some other internal component of the FS or device is not responding.
It is not always possible to differentiate accurately if this state is transient or long term, but the assumption is that it would be resolved sooner or later, either by automated failover to another node, automated recovery, or manual.
NSFS does not identify these errors specifically, and treats these errors as general internal errors, and therefore returns InternalError to the S3 client request immediately.
S3 clients vary in their retries options and configuration, but most clients will have a few retry attempts (3-5) with some exponential backoff.
For some S3 clients the retry attempts and strategy is easily configurable (e.g aws cli) but for other clients it is not always so configurable, and on top of that it becomes the application responsibility to adapt to the storage failure modes which is error prone.
Even if the Filesystem is able to recover or failover, S3 clients might give up due to retries exhausted.

S3 endpoints should be able to hold the client requests (as long as it was not timeout/aborted by the client).
By holding the request the client will have better chances to overcome the temporary unavailability.
NSFS filesystem calls should be wrapped with a retry_temporary_fs_errors function that will be able to detect temporary errors, and keep the context alive and retry the FS call with some backoff.
An important point is that this retry has to also detect if the client request was aborted.
We would need config options to control the maximum time/count to retry and be able to enable/disable this behavior to fit into different systems.
We also need to be able to differentiate the temporary unavailability errors from others, for example ESTALE is not always representing a recoverable error, so this detection may differ between different FS backends.

guymguym added Type:Enhancement New suggestions for behaviours NS-FS labels May 10, 2024

guymguym added this to the 5.16.z milestone May 10, 2024

guymguym mentioned this issue May 15, 2024

NSFS endpoint should set its process/threads name in the OS to be identified accurately #8049

Open

nimrod-becker added Non Containerized Non containerized and removed Non Containerized Non containerized labels May 22, 2024

Provide feedback