Accelerate urls checking using concurrency (asyncio + aiohttp) #50

SuperKogito · 2021-02-05T22:48:08Z

Urls are checked using a loop that tests the response of the requests sequentially, which becomes slow for huge websites.

Img source

Alternatively, we can use concurrency to process requests& responses asynchronously and speed up the system.

Img source

I already integrated this concept in my local repo using the asyncio and AIOHTTP libraries and the results look promising. The speed difference is notable based on various blogs (Python and fast HTTP clients, HTTP in Python: aiohttp vs. Requests, Making 1 million requests with python-aiohttp) and so far my tests confirm that.

Img source

The new libraries are slightly different from requests and so the following is true:

Different requests response format.
Different exceptions from previous ones.
Some disorder in the printed urls list (asynchronicity).
Need to remove duplicate urls before checking instead of adding urls to a seen set .
Possibly different timeout value needed.

I managed to almost replicate the same features we have in the current version but I will definitely need your feedback. Anyway, these differences bring me to my major question @vsoch : Do you think that we should add this feature as an option --accelerated-run or replace the current implementation with it

The text was updated successfully, but these errors were encountered:

vsoch · 2021-02-05T23:06:40Z

I think this is a great idea! If there is some reason that we'd want to keep the original implementation, then perhaps instead of doing --accelerated-run I would do the inverse and ask the user to specify --serial if they want the slower one (so faster is default). The other detail is with respect to reporting - if these are being run async we would have results printing fairly out of order, and perhaps it would make sense to still break down sets of urls by file (so each file still prints together). We could also do something similar to multiprocessing and make a queue, get the results (that include some key for the filename), and then control the order of the printing by way of that key. We also would want to understand that, by using asyncio, what versions of python are we no longer supporting.

Also, let's put in this change after the work I'm going to do this weekend to update the url arguments and spelling, just so it's easier to manage the different PRs.

vsoch · 2021-02-05T23:07:20Z

Those plots are beautiful, by the way! If we keep both fast / slow we could include a page in the docs that shows this difference.

SuperKogito · 2021-02-05T23:26:39Z

Having the fast version as default is a good idea. Atm I kept the urls separated by files thus one wouldn't notice the difference in outputs unless --retry-count > 1

The Python versions question is important; I still need to check that. So far I only tested my code on Python 3.7.
Of course, I intend to wait for your PR :) I just wanted to discuss this change and get your perspective on it, as it will affect the core of the library. But, I think this is a step in the right direction as it will give this library an advantage on other similar tools.

vsoch · 2021-02-05T23:37:18Z

Oh yes that's another thing to consider - if we do retry, we can't have it running in parallel (it needs to honor the timeout /delay).

vsoch · 2021-02-08T00:46:36Z

okay @SuperKogito take it away! I won't do any more work / changes until you have had time to work on this. And no rush! I just started a new job so I'm a bit under 💧 😆

SuperKogito · 2021-02-08T00:56:08Z

I will try to do my best 😜 Congratulations on the new job 🎉

SuperKogito added enhancement New feature or request discussion Discussing features, implementations and enhancements labels Feb 5, 2021

SuperKogito self-assigned this Feb 5, 2021

SuperKogito mentioned this issue Feb 14, 2021

Add/asyncio aiohttp acceleration #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate urls checking using concurrency (asyncio + aiohttp) #50

Accelerate urls checking using concurrency (asyncio + aiohttp) #50

SuperKogito commented Feb 5, 2021

vsoch commented Feb 5, 2021

vsoch commented Feb 5, 2021

SuperKogito commented Feb 5, 2021

vsoch commented Feb 5, 2021

vsoch commented Feb 8, 2021

SuperKogito commented Feb 8, 2021

Accelerate urls checking using concurrency (asyncio + aiohttp) #50

Accelerate urls checking using concurrency (asyncio + aiohttp) #50

Comments

SuperKogito commented Feb 5, 2021

vsoch commented Feb 5, 2021

vsoch commented Feb 5, 2021

SuperKogito commented Feb 5, 2021

vsoch commented Feb 5, 2021

vsoch commented Feb 8, 2021

SuperKogito commented Feb 8, 2021