-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable internal checksum validation in GCS client #96
Conversation
CloudFiles does checksum validation itself, so disable checksun inside the GCS client.
I wonder if this big difference is the google storage client using crcmod (pure python) instead of crcmod (compiled version). Whichever, it doesn't matter as it's duplicative work. Thanks for the PR, I'll test this. |
I looked into the GCS library, and we're using the same hash implementation (hashlib.md5). The only difference I can see is that they're using a blocked reading strategy so maybe for reasonably sized files this is slower? |
S3 probably has the same double validation logic. Trying to figure out what to do about it. |
I wonder if all or most the vendors cloud-files supports supports checksum validation, if so we can also disable the validation in cloud-files, but the speed penalty for GCS is a bit alarming. |
I'm trying to remember why I did download validation. It might be that I wasn't sure if GCS and s3 libraries were doing it. For uploads, I believe you have to manually provide the md5 (which we do). |
Make senses, I suppose the md5sum are needed to validate the updated data is not corrupted. |
BTW, we should disable checksum in GCS client for range request anyway. at the moment when downloading sharded objects like skeletons, there are tons of errors like this if I use logging:
|
I did some testing and I do see a decrease of about 1 second in user time for downloading 32 10MB files using this PR. I got a similar improvement by removing CloudFiles' checksum code. I'll just accept your PR now since the range requests are causing inappropriate logging from the Google checksum. |
Google storage client has md5 chucksum enabled by default. Since CloudFiles does checksum validation itself, we should disable checksum in the
download_as_bytes
call. This reduces overall time from 5s to 4s, when downloading 31 files, 10-20M each.For some reason it seems if I disable checksum validation in CloudFiles instead, there is no speed improvement.