Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

429 Client Error: Too Many Requests for url: https://archive.md/ #32

Open
catharsis71 opened this issue Sep 27, 2021 · 4 comments
Open

Comments

@catharsis71
Copy link

This has never worked for me, I always get 429 error

Running from within Python:

>>> import archiveis
>>> archive_url = archiveis.capture("http://www.example.com/")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/username/.local/share/virtualenvs/username-nbgasrwQ/lib/python3.8/site-packages/archiveis/api.py", line 39, in capture
    response.raise_for_status()
  File "/home/username/.local/share/virtualenvs/username-nbgasrwQ/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://archive.md/

Running command-line:

$ archiveis https://google.com/
Traceback (most recent call last):
  File "./.local/share/virtualenvs/google-JUflU5ax/bin/archiveis", line 8, in <module>
    sys.exit(cli())
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/archiveis/api.py", line 106, in cli
    archive_url = capture(url, **kwargs)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/archiveis/api.py", line 39, in capture
    response.raise_for_status()
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://archive.md/
@adrelanos
Copy link

Confirmed. Same here.

Except the domain atm is archive.is.

@jmlingeman
Copy link

Looking at the response that comes back it appears that it is getting caught by a CAPTCHA. Unsure what the proper way to deal with this is.

@bpmcdevitt
Copy link

I was getting 429's also.
What you can do to fix this is to try some different user agent strings. I switched mine to be Firefox on Linux and it worked:

The command-line program comes bundled with a -ua flag or --user-agent to change it.

archiveis -ua "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0" https://google.com
https://archive.md/wip/VBqdJ

@tedmiston
Copy link

Looking at the response that comes back it appears that it is getting caught by a CAPTCHA. Unsure what the proper way to deal with this is.

I think once you hit that captcha you're pretty much just stuck. I've tried in browser automation libraries like Playwright to do obvious things like click the captcha checkbox to no avail.

Like the comment above said, choosing a real user-agent probably helps.

Besides that, I think that their rate limiting is fairly IP based, so if you are able to distribute your requests across IPs that may help.

I have also noticed that they throttle me way more aggressively when using Cloudflare 1.1.1.1 or Warp. This may likely apply to other VPNs as well, but I haven't tested personally.

It would be nice to wrap up this project with a queuing system using the submissions as jobs and having automatic retries to make it more robust / act like a "service". I haven't seen anyone doing exactly that yet.

I have also noticed that, while infrequent, there are definitely some pages which seem to crash the archive.is archiver and never succeed in being archived. Not sure if there's a way to report that to them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants