Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content corruption when file is open during overwrite on bucket #2413

Open
hbs opened this issue Feb 14, 2024 · 13 comments
Open

Content corruption when file is open during overwrite on bucket #2413

hbs opened this issue Feb 14, 2024 · 13 comments

Comments

@hbs
Copy link

hbs commented Feb 14, 2024

Additional Information

Version of s3fs being used (s3fs --version)

V1.93

Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse or dpkg -s fuse)

2.9.9-3

Kernel information (uname -r)

Linux sl911168 5.4.0-171-generic #189-Ubuntu SMP Fri Jan 5 14:23:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

GNU/Linux Distribution, if applicable (cat /etc/os-release)

Ubuntu

How to run s3fs, if applicable

s3fs MOUNTPOINT -o bucket=BUCKET,_netdev,no_check_certificate,ro,passwd_file=PATH/TO/CREDS,use_path_request_style,url=ENDPOINT,umask=0022,allow_other,uid=501,gid=501,dev,suid,kernel_cache,max_background=1000,max_stat_cache_size=100000,parallel_count=30,multireq_max=30,use_cache=/data/s3fs.cache,ensure_diskfree=500000

s3fs syslog messages (grep s3fs /var/log/syslog, journalctl | grep s3fs, or s3fs outputs)

N/A

Details about issue

The following sequence allowed us to reproduce the issue most of the time:

  • Create an object FOO on the bucket
  • Mount the bucket using the above command
  • Open FOO using vi on the machine where the bucket is mounted
  • Overwrite the object FOO on the bucket with new content
  • Perform multiple cat FOO on the machine where the bucket is mounted
  • Close vi
  • Perform multiple cat FOO again

Most of the time the content of object FOO is not updated.

Doing the same without launching vi, i.e. without having the file open at the time its content is overwritten, leads to correct result with the latter cat FOO returning the new content.

@gaul
Copy link
Member

gaul commented Feb 15, 2024

Could you share more about the expected behavior? Without file locking, I expect uncoordinated writers to create arbitrary changes to the file which might appear to be corruption. s3fs -o use_cache will only add to this confusion.

That said, s3fs could reduce but not eliminate the appearance of corruption by using GetObject with the If-Match parameter and UploadPart with the x-amz-copy-source-if-match parameter to ensure that it operates only if the ETag matches. This would allow s3fs to do the right thing when vi does a rename replacement of files so that cat could return some error instead of showing part of the first object and part of the second.

@hbs
Copy link
Author

hbs commented Feb 15, 2024

I've probably done a poor job at my explanation.

In the sequence above, the step "Overwrite the object FOO on the bucket with new content." is not performed using vi but using s3cmd to push new content onto the bucket, vi is simply open and closed, no file modification is performed via the mount point.

@ggtakec
Copy link
Member

ggtakec commented Feb 19, 2024

When you open an object mounted with s3fs using vi, the following behavior occurs:

At first, s3fs downloads the contents of the file and save it to a file, and the contents of that file are then read by the vi process.

When s3fs started with use_cache option, if another process reads the file, the file with the same path will read the contents of the cache.
In other words, you cannot read the content uploaded by other s3 tools.

However, if the use_cache option is not specified, the updated file contents can be read because they are downloaded from the server when other processes read them.

Note that if the file is small and does not depend on this use_cache option, the updated content will be read instead of the cache.

@hbs
Copy link
Author

hbs commented Feb 19, 2024

I raised the issue because the behavior differs if you push new content to the bucket while the file is open in vi or not, hence I think there is indeed an issue somewhere.

@ggtakec
Copy link
Member

ggtakec commented Feb 19, 2024

@hbs Thanks for your quickly reply.
I have not yet been able to reproduce this problem.
Even when use_cache is used, updated files can be read while vi is running.

  • You are using the kernel_cache option, can you check again with this option removed?
  • And could you please let me know the results of adding the enable_content_md5 option?

I'm interested in these results. (By the way, I haven't had the same problem regardless of these option.)

@hbs
Copy link
Author

hbs commented Feb 29, 2024

One current example of the issue has the following elements:

.stats file has the following content:

64013277:26689926
0:26689926:0:0

File size is indeed 26689926 bytes, the sparse file in the cache contains only 0x00s, which is not the actual content of the file, and reading the file from the mount point only shows those 0x00s followed by some content not in the cache file, which means the original content is not fetched even though the .stats file seems to indicate the content was not loaded (if I interpret the :0: correctly on the second line).

@hbs
Copy link
Author

hbs commented Mar 1, 2024

The issue encountered might be to caching at the fuse level. How would s3fs behave in terms of access to the cache if the direct_io option is passed to fuse?

@hbs
Copy link
Author

hbs commented Mar 5, 2024

Another weirdness when the corruption happen, the stat file (under .bucket.stat) for a corrupted file has a single range covering the complete file with flags :0:1 even though the filesystem is mounted ro.

How can it be that the stat file thinks the file was modified when the fs is read only?

@hbs
Copy link
Author

hbs commented Mar 8, 2024

With the direct_io option the cache corruption issue still arises, with files showing the zeroed out content of the sparse file in the cache.

This seems somehow similar to #715

@ggtakec
Copy link
Member

ggtakec commented Mar 10, 2024

@hbs
(I'd like to let you know up front that I haven't been able to reproduce this problem yet, and that I don't fully understand what's at stake.)
Several issues similar to this issue have been reported, but they are difficult to reproduce and it takes time to identify the cause.

I've been asked several questions, so I'll provide a series of answers below:

First, if you specify the direct_io option at startup, it is used by FUSE. (i.e. an option that expected FUSE to not cache file content)
This option is handled by FUSE and does not affect cache files(files on the local disk) handled by s3fs.
s3fs does not open its own cache file(file content and state of cached file content information) as DIRECT_IO.

Next is the cache file information file under .<bucketname>.stat of s3fs, but this content is loaded internally when the target file is opened and is not updated until the file is closed.
There may be a misunderstanding on this point.

Also, the file content cache created under <bucketname> is a sparse file that holds the downloaded range of the target file content.
The area that has not been downloaded is in the HOLE state.

Then, when a file is opened and read, a portion (or all) of the file content is downloaded from the S3 server and stored in a cache file.
If the file is written(modified), it will be written to the cache file.
If updated the file, it will be uploaded to the S3 server when the file is closed or flushed or synced.

The cache file is used in this way, so even if it is mounted in RO mode, it is a file that is updated when it is downloaded.

If possible, please provide detailed steps to reproduce your problem or identify the cause.
Also, it would be helpful for analysis if you could start s3fs with dbglevel=info or curldbg and specify the log that seems to be the problem.
Thanks in advance for your assistance.

@hbs
Copy link
Author

hbs commented Mar 11, 2024

Hi, thanks for your comment. I'll try to detail further what is occurring so you can maybe identify the code to look for.

The set up is a bucket mounted in RO mode on a server. The bucket contains tens of thousands of files. The application accessing those files may keep them open for a very long period of time.

The issue which arises is that sometimes the application is provided with content which includes ranges in HOLE state. This is confirmed by simply looking at the problematic file via hexdump -C. Reading the s3fs cache file shows the same content as the one retrieved via the mount point.

The application may be closed from time to time, either cleanly, i.e. with files being closed before shutdown, or violently with no explicit file closing.

The s3fs cache is not cleaned upon startup as it contains several terabytes of data which would take quite some time to redownload with a significant impact on the application's performance while it is populated.

So in our setup, no files are ever modified (files could be modified on the bucket side when I initially filed the issue but this possibility has now been removed but we still experience the issue).

If I understand correctly what you wrote regarding the range files under .<bucketname>.stat, their content should only be considered correct once s3fs has been shut down cleanly and the in-memory range information has been flushed to disk.

Regarding the logs, given the amount of file access performed by the production application where the issue arises, I don't think I will be able to provide them, unless there is a way to rotate those logs once they reach a certain size so we can limit the total amount of space used by them.

@ggtakec
Copy link
Member

ggtakec commented Mar 17, 2024

@hbs Thank you for the detailed explanation.
I understand that collecting and checking logs may be difficult.

The cache file and its stat file are implemented based on the following assumptions:

There is a HOLE in the cache file created by s3fs, but when reading(accessing) the range of the HOLE area, that area is newly downloaded from the S3 server, written it to the HOLE area, and the HOLE is filled.
Also, the stat file of the cache file under <bucketname>.stat will be updated accordingly when the file is closed.

Even if a file is read from multiple processes, the read is performed via this cache file, and the same cache is shared and updated.
Even if one process leaves a file open and another process (or the same process) reads the file, its contents will be read through the same cache file.

When reading an uncached range (HOLE), the read range is downloaded from the S3 server, written to the cache file, and the HOLE is filled.

When s3fs is terminated(not forced), any open files are closed.
The cache file's stat information is also serialized to update when the file is closed.
This should correctly reflect the state of the cache file(information such as HOLE) in the stat file.
Therefore, the cache file and its stat file left behind when s3fs is terminated remain as a matched pair.

After (re)starting s3fs, these cache files and cache stat files will be loaded and used again when you open the file.
This allows s3fs to know the HOLE area of the cache file even after restarting, and the cached portion can continue to be read from the cache file.

When opening a file, the stat of the file on the S3 server is compared with the cache file's one(mtime and file size) to determine whether the cache file is stale.
If the results of this comparison do not match, the cache file will be discarded.

The s3fs cache is designed and implemented like this.
If s3fs does not download the HOLE range from the S3 server but reads it from the cache file, it may be a problem with s3fs.

Unfortunately, I have not yet been able to reproduce the same phenomenon as this issue, so I am not able to understand the cause.

@beatstream69
Copy link

beatstream69 commented Apr 3, 2024

Seems like I experience the same issue. S3 bucket mounted via fstab in readonly mode. Files in s3 is not modified.

fstab config

dataset /mnt/dataset fuse.s3fs _netdev,allow_other,use_cache=/mnt/data-ssd/s3-cache,passwd_file=/root/.passwd-s3fs,use_path_request_style,url=https://s3.example.com,uid=1000,gid=1000 0 0

System and s3fs versions Debian 12.5

Linux jupyter2 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
Amazon Simple Storage Service File System V1.90 (commit:unknown) with GnuTLS(gcrypt)
fuse (2.9.9-6)

Corrupted file is filled with zeros, content of .stat file

38274300:191275258
0:191275258:0:1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants