Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many images on page slows down request when using page.images.with_size #242

Open
martijndekuijper opened this issue Nov 6, 2018 · 4 comments

Comments

@martijndekuijper
Copy link

Hey, when you request a page with many images (eg 80) and use page.images.with_size the request is so slow that our applications times out (>30s).

Is there a way to limit the number of images it calculates the size for? It would love to be able to only fetch the size of the first 10 images for example.

@martijndekuijper martijndekuijper changed the title Many images on page slows down request when using page.images.with_size Many images on page slows down request when using page.images.with_size Nov 6, 2018
@jaimeiniesta
Copy link
Owner

jaimeiniesta commented Nov 6, 2018

You can disable image downloading entirely if you want:

https://github.com/jaimeiniesta/metainspector#image-downloading

Downloading images can take a lot of time if there are many images and they take a lot to download. Better do this in a background job so that the server does not time out.

To improve this, I can think of:

a) Introduce a new option to limit the number of images to download. Or maybe we could change it to be an integer, instead of a boolean. For example, instead of download_images: false we could have download_images: 0 or download_images: 10. So by default we download all the images, but if an integer is specified we limit it to that.

b) We could try downloading the images in parallel.

What do you think?

@martijndekuijper
Copy link
Author

O wow, this speeds up things drastically and fixed my issue for now! Thanks so much.

I feel this can indeed be improved even more by being able to set an option to limit the number of images to download. Loading in parallel also might fix the load time, but it still downloads the images then, which might be redundant if you don't need the last say 70 images. Right?

@jaimeiniesta
Copy link
Owner

That's right, if you're interested in downloading images to get better results for their dimensions then this should be definitely go into a background job. Well, as any external request can take a long time, maybe all things related to scraping should go in background jobs as a general rule.

You're right that we could do a) and b), that is, let us specify the number of images to download, and also try to download them in parallel.

@ishields
Copy link
Contributor

ishields commented Jun 13, 2020

I think the idea of making "download_images" an integer instead of a boolean is a good one too. I will make a PR. That or since there are more and more image options, maybe we just add a hash: image_options: { download: true, max_downloaded_images: 300 }
and include other ones I've aded in https://github.com/metainspector/metainspector/pull/267s (fetch_all_image_meta and image_blacklist_words)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants