Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't create indexes from un-seekable fileobj's #95

Open
epicfaace opened this issue Aug 16, 2022 · 5 comments
Open

Can't create indexes from un-seekable fileobj's #95

epicfaace opened this issue Aug 16, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@epicfaace
Copy link
Contributor

epicfaace commented Aug 16, 2022

In theory, it should be possible to create indexes from un-seekable archive fileobj's (by using peek instead of read to check file headers, for example). I was able to get this to work by modifying an older version of ratarmount core here (https://github.com/codalab/codalab-worksheets/pull/4212/files#diff-ad5ad76eb55b6437b2e1aa24b324c6d11d176be1926ebea1950a006bfce4efbe). It would be nice if we could do the same for the current version of ratarmountcore, though the current version seems to rely a lot more on seek.

@mxmlnkn
Copy link
Owner

mxmlnkn commented Aug 17, 2022

I don't understand the use case. Why do you need an index for seeking when you can't seek on the file?

@epicfaace
Copy link
Contributor Author

@mxmlnkn Here's the use case: we can't seek during index creation but we can seek once the index is available.

CodaLab allows users to upload files to Azure Blob Storage. During the upload process, we want to 1) upload the file to Blob Storage and 2) create the index. Once the file is on Azure Blob Storage, users can then download particular parts of the file using the ratarmount index.

However, during the upload process, we are often just .tar.gz-streaming a directory -- so the fileobj is un-seekable. So it would be nice to be able to feed this stream directly into ratarmount so we can create the index as well. Right now, we currently first 1) upload the file to Blob Storage and then 2) download the entire file again, so that ratarmount can create the index -- this is slow and inefficient. Ideally, we could do both at the same time and not have to re-download the entire file just to create the index.

@mxmlnkn mxmlnkn added the enhancement New feature or request label Aug 17, 2022
@epicfaace
Copy link
Contributor Author

Looks like we might need to resolve pauldmccarthy/indexed_gzip#102 first though...

@mxmlnkn
Copy link
Owner

mxmlnkn commented Aug 17, 2022

Looks like we might need to resolve pauldmccarthy/indexed_gzip#102 first though...

I was going to write something like that before you did ;).

And it's not only indexed_gzip. I wanna swap indexed_gzip with pragzip sometime in the future and it also might have that issue. And the same goes for bz2 and zstd. I know that you probably only need gzip but it feels like a bug if only very specific file formats are supported.

In the end, I agree that it might be useful but it seems difficult to implement with all of ratarmount features. It might be implemented as a separate "function" on top of ratarmountcore because you wouldn't even need FUSE for that. I imagine something like wget -O- remote.tar.gz | tee downloaded.tar.gz | ratarmount --index-file downloaded.tar.gz.index. Theoretically, it could be enough to detect stdin being written to enter that alternate mode but we might also trigger it explicitly with something akin to a --create-index-from-stream option. The ratarmount CLI gets kinda clunky, I'm wanting to redesign it to something akin to git with subcommands when it makes sense. I didn't think deeper about it yet, I only categorized the options in the help output.

Currently, I won't be able to work on this in a timely manner. I will review PRs though if you find the time to take a deeper look into it.

Edit: One such mentioned problematic ratarmount option for streaming support could be --recursive because it might try to seek back. But similar to unsupported file formats, it could be checked if the alternate mode has been activated and print out an error message.

@epicfaace
Copy link
Contributor Author

Currently, I won't be able to work on this in a timely manner. I will review PRs though if you find the time to take a deeper look into it.

That works!! Thanks. By the way pauldmccarthy/indexed_gzip#102 is now resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants