Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate urls checking using concurrency (asyncio + aiohttp) #50

Open
SuperKogito opened this issue Feb 5, 2021 · 6 comments
Open
Assignees
Labels
discussion Discussing features, implementations and enhancements enhancement New feature or request

Comments

@SuperKogito
Copy link
Member

Urls are checked using a loop that tests the response of the requests sequentially, which becomes slow for huge websites.


Img source

Alternatively, we can use concurrency to process requests& responses asynchronously and speed up the system.


Img source

I already integrated this concept in my local repo using the asyncio and AIOHTTP libraries and the results look promising. The speed difference is notable based on various blogs (Python and fast HTTP clients, HTTP in Python: aiohttp vs. Requests, Making 1 million requests with python-aiohttp) and so far my tests confirm that.

Img source

The new libraries are slightly different from requests and so the following is true:

  • Different requests response format.
  • Different exceptions from previous ones.
  • Some disorder in the printed urls list (asynchronicity).
  • Need to remove duplicate urls before checking instead of adding urls to a seen set .
  • Possibly different timeout value needed.

I managed to almost replicate the same features we have in the current version but I will definitely need your feedback. Anyway, these differences bring me to my major question @vsoch : Do you think that we should add this feature as an option --accelerated-run or replace the current implementation with it

@SuperKogito SuperKogito added enhancement New feature or request discussion Discussing features, implementations and enhancements labels Feb 5, 2021
@SuperKogito SuperKogito self-assigned this Feb 5, 2021
@vsoch
Copy link
Collaborator

vsoch commented Feb 5, 2021

I think this is a great idea! If there is some reason that we'd want to keep the original implementation, then perhaps instead of doing --accelerated-run I would do the inverse and ask the user to specify --serial if they want the slower one (so faster is default). The other detail is with respect to reporting - if these are being run async we would have results printing fairly out of order, and perhaps it would make sense to still break down sets of urls by file (so each file still prints together). We could also do something similar to multiprocessing and make a queue, get the results (that include some key for the filename), and then control the order of the printing by way of that key. We also would want to understand that, by using asyncio, what versions of python are we no longer supporting.

Also, let's put in this change after the work I'm going to do this weekend to update the url arguments and spelling, just so it's easier to manage the different PRs.

@vsoch
Copy link
Collaborator

vsoch commented Feb 5, 2021

Those plots are beautiful, by the way! If we keep both fast / slow we could include a page in the docs that shows this difference.

@SuperKogito
Copy link
Member Author

Having the fast version as default is a good idea. Atm I kept the urls separated by files thus one wouldn't notice the difference in outputs unless --retry-count > 1
Screenshot from 2021-02-06 00-15-05
Screenshot from 2021-02-06 00-14-00

The Python versions question is important; I still need to check that. So far I only tested my code on Python 3.7.
Of course, I intend to wait for your PR :) I just wanted to discuss this change and get your perspective on it, as it will affect the core of the library. But, I think this is a step in the right direction as it will give this library an advantage on other similar tools.

@vsoch
Copy link
Collaborator

vsoch commented Feb 5, 2021

Oh yes that's another thing to consider - if we do retry, we can't have it running in parallel (it needs to honor the timeout /delay).

@vsoch
Copy link
Collaborator

vsoch commented Feb 8, 2021

okay @SuperKogito take it away! I won't do any more work / changes until you have had time to work on this. And no rush! I just started a new job so I'm a bit under 💧 😆

@SuperKogito
Copy link
Member Author

I will try to do my best 😜 Congratulations on the new job 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing features, implementations and enhancements enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants