New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileExpired exception when reading parquet from a Minio bucket using Dask #11044
Comments
Thanks for your bug report. Without a reproducer, we'll likely have a hard time helping out. I noticed that the Traceback you are posting is incomplete. Can you please verify if there is something missing? Maybe this points to the problem. Particularly the |
Yes, I know it is difficult without a working example to reproduce this error, this is also my main problem when trying to fix it. I posted the complete traceback at the end and I only modified sensitive information such as the filename. The TypeError is the last exception called and it terminates exactly like this |
I was able to reproduce the error using the following script
and after about a hour the exception was triggered. I was able also to retrieve the complete traceback:
For this example I used Windows. If there is there any other information that can be useful let me now. |
@martindurant are you familiar with this kind of error? |
This means that the entry for the file contained in the directory listings and help in the file-like instance no longer matches the remote store, because the file has been overwritten. This is intentional, an open file should become invalid if it changed while reading. I'm not sure of any specifics of minio, but since this appears to be a time related effect, you may wish to add |
Thank you for you answer and sorry for the delay. I cannot understand why this happens only sometimes. Files are overwritten every 15 minutes but not all of them raise this exception. I will try with |
Describe the issue:
I have a list of dataframes in a Minio bucket that are updated every 15 minutes. My script runs inside a Docker container in a loop and every 15 minutes a list of futures is created to read and preprocess every dataframe in the bucket. When computing the result it happens sometimes that the following exception is triggered:
and triggers
TypeError: __init__() missing 1 required positional argument: 'e_tag'
. Even catching the exception does not solve the problem since during the next iteration it is triggered again. I checked on Minio and the Etag effectively does not correspond to that in the exception but I do not know how to solve this problem. The code to read data is thisMinimal Complete Verifiable Example:
Providing a verifiable example is difficult since it runs on Docker and it is the result of various interacting scripts. I tried to replicate it running a simple script outside of docker but the problem does not appear. This is the script I used that is similar to what the original script does.
Anything else we need to know?:
I have tried to use the function
invalidate_cache()
of s3fs and usingignore_metadata_file=True
when reading data but it didn't worked. Catching the exception works but the problem is not solved during the following iteration.Here is the complete traceback if you find it useful
Environment:
The text was updated successfully, but these errors were encountered: