Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does gcsfuse support content-encoding: gzip? #671

Open
BenWang91 opened this issue May 3, 2022 · 4 comments
Open

does gcsfuse support content-encoding: gzip? #671

BenWang91 opened this issue May 3, 2022 · 4 comments
Labels
Eng-Backlog feature request Feature request: request to add new features or functionality p1 P1

Comments

@BenWang91
Copy link

I have a gz file on GCS with content-conding:gzip, and this is the error I saw when I tried to gunzip it.

fuse_debug: 2022/04/18 23:06:18.538054 Op 0x00000038        connection.go:416] <- LookUpInode (parent 2, name "1650322527
.log", PID 3172)
fuse_debug: 2022/04/18 23:06:18.538244 Op 0x00000038        connection.go:498] -> OK (inode 3)
fuse_debug: 2022/04/18 23:06:18.538331 Op 0x0000003a        connection.go:416] <- OpenFile (inode 3, PID 3172)
fuse_debug: 2022/04/18 23:06:18.538432 Op 0x0000003a        connection.go:498] -> OK ()
fuse_debug: 2022/04/18 23:06:18.538544 Op 0x0000003c        connection.go:416] <- ReadFile (inode 3, PID 3172, handle 3, 
offset 0, 4096 bytes)
gcs: 2022/04/18 23:06:18.538671 Req              0x5: <- Read("vector/1650322527.log", [0, 444))
gcs: 2022/04/18 23:06:18.575237 Req              0x5: -> Read error: not retrying Read("vector/1650322527.log", 165032274
6957504): Received unexpected status code 200 instead of HTTP 206
2022/04/18 23:06:18.575310 ReadFile: input/output error, fh.reader.ReadAt: readFull: not retrying Read("vector/1650322527
.log", 1650322746957504): Received unexpected status code 200 instead of HTTP 206
fuse_debug: 2022/04/18 23:06:18.575363 Op 0x0000003c        connection.go:500] -> Error: "input/output error"
fuse: 2022/04/18 23:06:18.575378 *fuseops.ReadFileOp error: input/output error
fuse_debug: 2022/04/18 23:06:18.575777 Op 0x0000003e        connection.go:416] <- FlushFile (inode 3, PID 3172)
fuse_debug: 2022/04/18 23:06:18.575940 Op 0x0000003e        connection.go:498] -> OK ()
fuse_debug: 2022/04/18 23:06:18.576093 Op 0x00000040        connection.go:416] <- ReleaseFileHandle (PID 0)
gcs: 2022/04/18 23:06:18.576174 Req              0x5: -> Read("vector/1650322527.log", [0, 444)) (37.542666ms): OK
fuse_debug: 2022/04/18 23:06:18.576207 Op 0x00000040        connection.go:498] -> OK ()

By looking at this issue #165, it seems that gcsfuse doesn't intend to support this back in 2016. Is this still the decision nowadays? Thanks!

@avidullu avidullu added the feature request Feature request: request to add new features or functionality label May 5, 2022
@avidullu
Copy link
Contributor

avidullu commented May 5, 2022

Thanks for the request here. As you pointed out, a while back there was a good reason for gcsfuse to not support content encoding but we can definitely investigate if that is possible currently. Please let us take a look and we'll post an update soon.

@xor-xor
Copy link

xor-xor commented Jul 12, 2022

@avidullu any update on this? As a user of GCP's Vertex Pipelines, I'd really, really like to have an option to read compressed files on executors via locally mounted GCS. Actually, I was quite surprised that something like this doesn't work ;)

@sethiay sethiay added the p1 P1 label Jun 6, 2023
@bhack
Copy link

bhack commented Jun 15, 2023

Same for zip. Currently are text/plain; charset=utf-8 in GCS metadata.

@marcoa6
Copy link
Collaborator

marcoa6 commented Sep 27, 2023

With v1.1 Cloud Storage FUSE now supports reading back objects as gzip if content-encoding: gzip metadata is set. The detailed behavior is documented under "File transcoding".

For the reasons already mentioned, Cloud Storage FUSE only supports reading the file back as gzip and does not do decompressive transcoding over the wire. Previously, Cloud Storage FUSE would just return an error and not even allow the file to be returned as gzip.

Attempting to use Cloud Storage FUSE to edit or modify objects with content-encoding: gzip can produce unpredictable behavior. This is because Cloud Storage FUSE uploads the object content as it is (without compressing it) while retaining content-encoding: gzip, and if this content is not properly gzip-compressed, it might fail in being read from the server by other clients such as gsutil. This is because other clients employ decompressive transcoding while reading, and it will fail for improper gzip content.

So, unzipping the file within the GCSfuse directory will create the unzipped files as new files in GCS, or can be unzipped to a local directory. If unzipped within a GCSfuse directory and then modified, the files will be written back as raw and uncompressed, not as gzip. If the files need to be modified, and written back as gzip, this process should be done outside of the GCSfuse mounted directory: gunzip to local directory-->read/modify files locally-->re-zip locally--> replace old gzip file in GCSfuse directory with new gzip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Eng-Backlog feature request Feature request: request to add new features or functionality p1 P1
Projects
None yet
Development

No branches or pull requests

7 participants