Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable internal checksum validation in GCS client #96

Merged
merged 1 commit into from
Sep 6, 2023

Conversation

ranlu
Copy link
Contributor

@ranlu ranlu commented Sep 1, 2023

Google storage client has md5 chucksum enabled by default. Since CloudFiles does checksum validation itself, we should disable checksum in the download_as_bytes call. This reduces overall time from 5s to 4s, when downloading 31 files, 10-20M each.

For some reason it seems if I disable checksum validation in CloudFiles instead, there is no speed improvement.

CloudFiles does checksum validation itself, so disable checksun inside
the GCS client.
@william-silversmith
Copy link
Contributor

I wonder if this big difference is the google storage client using crcmod (pure python) instead of crcmod (compiled version). Whichever, it doesn't matter as it's duplicative work. Thanks for the PR, I'll test this.

@william-silversmith william-silversmith added the performance Significantly affects time or space. label Sep 1, 2023
@william-silversmith
Copy link
Contributor

I looked into the GCS library, and we're using the same hash implementation (hashlib.md5). The only difference I can see is that they're using a blocked reading strategy so maybe for reasonably sized files this is slower?

@william-silversmith
Copy link
Contributor

S3 probably has the same double validation logic. Trying to figure out what to do about it.

@ranlu
Copy link
Contributor Author

ranlu commented Sep 1, 2023

I wonder if all or most the vendors cloud-files supports supports checksum validation, if so we can also disable the validation in cloud-files, but the speed penalty for GCS is a bit alarming.

@william-silversmith
Copy link
Contributor

I'm trying to remember why I did download validation. It might be that I wasn't sure if GCS and s3 libraries were doing it. For uploads, I believe you have to manually provide the md5 (which we do).

@ranlu
Copy link
Contributor Author

ranlu commented Sep 1, 2023

Make senses, I suppose the md5sum are needed to validate the updated data is not corrupted.

@ranlu
Copy link
Contributor Author

ranlu commented Sep 1, 2023

BTW, we should disable checksum in GCS client for range request anyway. at the moment when downloading sharded objects like skeletons, there are tons of errors like this if I use logging:

2023-09-01 07:40:14 INFO     No MD5 checksum was returned from the service while downloading https://storage.googleapis.com/download/storage/v1/b/xxxxxxxx/o/skeletons_mip_1%2F0494.shard?alt=media
(which happens for composite objects), so client-side content integrity checking is not being performed.

@william-silversmith
Copy link
Contributor

I did some testing and I do see a decrease of about 1 second in user time for downloading 32 10MB files using this PR. I got a similar improvement by removing CloudFiles' checksum code. I'll just accept your PR now since the range requests are causing inappropriate logging from the Google checksum.

@william-silversmith william-silversmith merged commit 75ae21d into master Sep 6, 2023
5 checks passed
@william-silversmith william-silversmith deleted the disable_gcs_checksum branch September 6, 2023 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Significantly affects time or space.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants