-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyCurl for Image Downloading on Add Documents #814
Draft
OwenPendrighElliott
wants to merge
4
commits into
mainline
Choose a base branch
from
owen/pycurl_image_download
base: mainline
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
src/marqo/s2_inference/clip_utils.py
Outdated
raise UnidentifiedImageError( | ||
f"image url `{image_path}` is unreachable, perhaps due to timeout. " | ||
f"Timeout threshold is set to {timeout} seconds." | ||
f"\nConnection error type: `{e.__class__.__name__}`") | ||
f"\nConnection error type: `{e.__class__.__name__}`" | ||
f"\n{e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to remove f"\n{e}"
src/marqo/s2_inference/clip_utils.py
Outdated
@@ -2,7 +2,10 @@ | |||
from marqo.tensor_search.enums import ModelProperties, InferenceParams | |||
from marqo.tensor_search.models.private_models import ModelLocation, ModelAuth | |||
import validators | |||
import requests | |||
# import requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove commented import
OwenPendrighElliott
had a problem deploying
to
marqo-test-suite
April 21, 2024 13:36
— with
GitHub Actions
Failure
OwenPendrighElliott
had a problem deploying
to
marqo-test-suite
May 3, 2024 04:52
— with
GitHub Actions
Failure
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improvement
Requests is used to download images and introduces a bottleneck in multimodal indexing.
In another piece of work I was experimenting with alternative to
requests
for https requests to get files. My testing showed thatpycurl
was about 3-4x faster than requests so I wanted to test if the same applied to image downloading in Marqo. It appears that it does hold true, even with a naive implementation. Hitting images in a US S3 bucket from Aus was 3-4x faster withpycurl
, examples in our docs are about 2-3x faster. Using our local image search demo as an example, running on a machine with an Nvidia 2080 SUPER using 1 worker and a batch size of 32:Below is a histogram of the
image_download.full_time
from telemetry for the first 100 batches using each implementation, the difference is notable - this pattern holds at 6 concurrent workers. There are other benchmarks which confirm pycurl is faster in many applications.Here is an example response from the multimodal example in our docs:
With requests (1446ms):
With PyCurl (563ms):
The current implementation I did is pretty basic - it simply instantiates the
PyCurl
object in each image download. I expect that we can could get even better performance using the I/O multiplexing and DNS caching. As far as I can tell the only reason this implementation is faster is becauselibcurl
is simply faster thanurllib3
, I tried disabling the streaming withrequests
as is currently done but that didn't change the speed at all. This adds a dependency forpycurl
andcertifi
(certifi
may or may not be needed but I had certificate issues on WSL without it), this also might increase memory usage because there is a moment when the raw image data and the PIL image are held in memory together although the raw data should be dropped from memory almost instantly as it immediately becomes out of scope. Can someone check the changes (maybe from a machine in the US) and confirm these results are not just specific to my machine?No
No
TODO