PyCurl for Image Downloading on Add Documents #814

OwenPendrighElliott · 2024-04-21T13:35:07Z

What kind of change does this PR introduce?

Improvement

What is the current behavior? (You can also link to an open issue here)

Requests is used to download images and introduces a bottleneck in multimodal indexing.

What is the new behavior (if this is a feature change)?

In another piece of work I was experimenting with alternative to requests for https requests to get files. My testing showed that pycurl was about 3-4x faster than requests so I wanted to test if the same applied to image downloading in Marqo. It appears that it does hold true, even with a naive implementation. Hitting images in a US S3 bucket from Aus was 3-4x faster with pycurl, examples in our docs are about 2-3x faster. Using our local image search demo as an example, running on a machine with an Nvidia 2080 SUPER using 1 worker and a batch size of 32:

The mainline branch of Marqo would take about 17 hours (9-10 seconds per batch of 32)
The updated pycurl implementation would take 7.5 hours (3-4 seconds per batch of 32)

Below is a histogram of the image_download.full_time from telemetry for the first 100 batches using each implementation, the difference is notable - this pattern holds at 6 concurrent workers. There are other benchmarks which confirm pycurl is faster in many applications.

Here is an example response from the multimodal example in our docs:

With requests (1446ms):

{'errors': False, 'processingTimeMs': 1445.500699999684, 'index_name': 'my-first-multimodal-index', 'items': [{'status': 200, '_id': '6424ff33-63e0-4104-ab25-8f1adea50ead'}, 
{'status': 200, '_id': '0b89345a-1dee-4050-9caa-e48b794a1662'}, {'status': 200, '_id': 'c7ccee6f-213f-4b7f-be2a-1e8e3402b5dd'}]}

With PyCurl (563ms):

{'errors': False, 'processingTimeMs': 562.6993000005314, 'index_name': 'my-first-multimodal-index', 'items': [{'status': 200, '_id': '3c70a84c-e68b-4547-b8ef-20cf2fab4cf9'}, 
{'status': 200, '_id': 'a8646b51-0640-4bb0-8661-95efc1023f0b'}, {'status': 200, '_id': 'ca44ae44-88d5-4eb8-8802-4b280e48c968'}]}

The current implementation I did is pretty basic - it simply instantiates the PyCurl object in each image download. I expect that we can could get even better performance using the I/O multiplexing and DNS caching. As far as I can tell the only reason this implementation is faster is because libcurl is simply faster than urllib3, I tried disabling the streaming with requests as is currently done but that didn't change the speed at all. This adds a dependency for pycurl and certifi (certifi may or may not be needed but I had certificate issues on WSL without it), this also might increase memory usage because there is a moment when the raw image data and the PIL image are held in memory together although the raw data should be dropped from memory almost instantly as it immediately becomes out of scope. Can someone check the changes (maybe from a machine in the US) and confirm these results are not just specific to my machine?

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No

Have unit tests been run against this PR? (Has there also been any additional testing?)

No

TODO

check memory impact
test from other machines in other localities

Please check if the PR fulfills these requirements

The commit message follows our guidelines
Tests for the changes have been added (for bug fixes/features)
Docs have been added / updated (for bug fixes / features)

OwenPendrighElliott · 2024-04-21T13:36:10Z

src/marqo/s2_inference/clip_utils.py

            raise UnidentifiedImageError(
                f"image url `{image_path}` is unreachable, perhaps due to timeout. "
                f"Timeout threshold is set to {timeout} seconds."
-                f"\nConnection error type: `{e.__class__.__name__}`")
+                f"\nConnection error type: `{e.__class__.__name__}`"
+                f"\n{e}")


will need to remove f"\n{e}"

OwenPendrighElliott · 2024-04-21T13:36:30Z

src/marqo/s2_inference/clip_utils.py

@@ -2,7 +2,10 @@
 from marqo.tensor_search.enums import ModelProperties, InferenceParams
 from marqo.tensor_search.models.private_models import ModelLocation, ModelAuth
 import validators
-import requests
+# import requests


will remove commented import

basic pycurl implementation for image fetching

f48ddb2

OwenPendrighElliott requested a review from farshidz April 21, 2024 13:35

OwenPendrighElliott commented Apr 21, 2024

View reviewed changes

OwenPendrighElliott had a problem deploying to marqo-test-suite April 21, 2024 13:36 — with GitHub Actions Failure

OwenPendrighElliott added 3 commits May 3, 2024 13:44

Merge branch 'mainline' into owen/pycurl_image_download

d420af0

updating test mocks

5153f1c

minor cleanup on code

766ac92

OwenPendrighElliott had a problem deploying to marqo-test-suite May 3, 2024 04:52 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyCurl for Image Downloading on Add Documents #814

PyCurl for Image Downloading on Add Documents #814

OwenPendrighElliott commented Apr 21, 2024

OwenPendrighElliott Apr 21, 2024

OwenPendrighElliott Apr 21, 2024

PyCurl for Image Downloading on Add Documents #814

Are you sure you want to change the base?

PyCurl for Image Downloading on Add Documents #814

Conversation

OwenPendrighElliott commented Apr 21, 2024

OwenPendrighElliott Apr 21, 2024

Choose a reason for hiding this comment

OwenPendrighElliott Apr 21, 2024

Choose a reason for hiding this comment