Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wayback_machine_downloader get lots Connection refused #264

Closed
intercoop opened this issue Nov 7, 2023 · 11 comments · May be fixed by #266
Closed

wayback_machine_downloader get lots Connection refused #264

intercoop opened this issue Nov 7, 2023 · 11 comments · May be fixed by #266

Comments

@intercoop
Copy link

intercoop commented Nov 7, 2023

Last month, I ran the wayback_machine_downloader normally ok ,But starting from yesterday,I tried many domain names, each returned result was a connection refused,
The command like this : wayback_machine_downloader http://huzhan.com --concurrency 3 -t 20220525005404 -a
The corresponding result like this take a look below:
https://www.huzhan.com/code/goods377071.html -> websites/huzhan.com/code/goods377071.html (280/112619)
https://www.huzhan.com/serve/goods14529.html -> websites/huzhan.com/serve/goods14529.html (281/112619)
https://www.huzhan.com/serve/goods12899.html # Connection refused - connect(2)
https://www.huzhan.com/serve/goods12899.html -> websites/huzhan.com/serve/goods12899.html (282/112619)
https://www.huzhan.com/ishop42980/ # Connection refused - connect(2)
https://www.huzhan.com/ishop42980/ -> websites/huzhan.com/ishop42980/index.html (283/112619)
https://www.huzhan.com/code/goods421671.html # Connection refused - connect(2)
https://www.huzhan.com/code/goods421671.html -> websites/huzhan.com/code/goods421671.html (284/112619)
https://www.huzhan.com/serve/goods15588.html # Connection refused - connect(2)
https://www.huzhan.com/serve/goods15588.html -> websites/huzhan.com/serve/goods15588.html (285/112619)
https://www.huzhan.com/serve/goods15287.html # Connection refused - connect(2)
https://www.huzhan.com/serve/goods15287.html -> websites/huzhan.com/serve/goods15287.html (286/112619)
https://www.huzhan.com/code/goods420832.html # Connection refused - connect(2)
https://www.huzhan.com/code/goods420832.html -> websites/huzhan.com/code/goods420832.html (287/112619)
https://www.huzhan.com/ishop37725/ # Connection refused - connect(2)
https://www.huzhan.com/ishop37725/ -> websites/huzhan.com/ishop37725/index.html (288/112619)
https://www.huzhan.com/code/goods372252.html # Connection refused - connect(2)
https://www.huzhan.com/code/goods372252.html -> websites/huzhan.com/code/goods372252.html (289/112619)
https://www.huzhan.com/code/goods418192.html # Connection refused - connect(2)
https://www.huzhan.com/ishop21789/ # Connection refused - connect(2)
https://www.huzhan.com/code/goods418192.html -> websites/huzhan.com/code/goods418192.html (290/112619)
https://www.huzhan.com/ishop21789/ -> websites/huzhan.com/ishop21789/index.html (291/112619)
https://www.huzhan.com/code/goods354759.html # Connection refused - connect(2)
https://www.huzhan.com/code/goods354759.html -> websites/huzhan.com/code/goods354759.html (292/112619)
https://www.huzhan.com/code/goods421676.html # Connection refused - connect(2)
https://www.huzhan.com/code/goods421676.html -> websites/huzhan.com/code/goods421676.html (293/112619)
https://www.huzhan.com/code/goods412576.html # Connection refused - connect(2)
https://www.huzhan.com/ishop40294/ # Connection refused - connect(2)
https://www.huzhan.com/code/goods412576.html -> websites/huzhan.com/code/goods412576.html (294/112619)
https://www.huzhan.com/ishop40294/ -> websites/huzhan.com/ishop40294/index.html (295/112619)
https://www.huzhan.com/ishop40283/ # Connection refused - connect(2)
https://www.huzhan.com/ishop40283/ -> websites/huzhan.com/ishop40283/index.html (296/112619)
https://www.huzhan.com/serve/goods15226.html # Connection refused - connect(2)
https://www.huzhan.com/serve/goods15226.html -> websites/huzhan.com/serve/goods15226.html (297/112619)
https://www.huzhan.com/ishop44505/ # Connection refused - connect(2)
https://www.huzhan.com/ishop44505/ -> websites/huzhan.com/ishop44505/index.html (298/112619)
https://www.huzhan.com/code/goods410194.html # Connection refused - connect(2)
https://www.huzhan.com/code/goods410194.html -> websites/huzhan.com/code/goods410194.html (299/112619)
https://www.huzhan.com/ishop41272/ # Connection refused - connect(2)
https://www.huzhan.com/serve/goods15735.html # Connection refused - connect(2)
https://www.huzhan.com/ishop41272/ -> websites/huzhan.com/ishop41272/index.html (300/112619)
https://www.huzhan.com/serve/goods15735.html -> websites/huzhan.com/serve/goods15735.html (301/112619)
https://www.huzhan.com/code/goods420725.html # Connection refused - connect(2)
https://www.huzhan.com/code/goods420725.html -> websites/huzhan.com/code/goods420725.html (302/112619)
https://www.huzhan.com/ishop43261/ # Connection refused - connect(2)
https://www.huzhan.com/ishop43261/ -> websites/huzhan.com/ishop43261/index.html (303/112619)
https://www.huzhan.com/serve/goods15565.html # Connection refused - connect(2)
https://www.huzhan.com/serve/goods15565.html -> websites/huzhan.com/serve/goods15565.html (304/112619)
https://www.huzhan.com/ishop44358/ # Connection refused - connect(2)
https://www.huzhan.com/ishop44358/ -> websites/huzhan.com/ishop44358/index.html (305/112619)
https://www.huzhan.com/code/page/4 # Connection refused - connect(2)
https://www.huzhan.com/ishop7456/ # Connection refused - connect(2)

then,I get lots files is empty, Did the archive website implement controls to prevent crawling? Because I can access it normally using a browser,Similarly I can also obtain the files by Wget tool,Thank you for following this issue !

@lkurz
Copy link

lkurz commented Nov 7, 2023

+1

@colinframe
Copy link

It looks like they're using rate limiting on requests, I cracked open the gem and put a random 3-10s delay in WaybackMachineDownloader#download_file and managed to get a full download of he site I was after, took a good bit longer than usual though.

@danest
Copy link

danest commented Nov 9, 2023

It looks like they're using rate limiting on requests, I cracked open the gem and put a random 3-10s delay in WaybackMachineDownloader#download_file and managed to get a full download of he site I was after, took a good bit longer than usual though.

Thanks, I just did this too and it's working again.

@YesYesTheDev
Copy link

It looks like they're using rate limiting on requests, I cracked open the gem and put a random 3-10s delay in WaybackMachineDownloader#download_file and managed to get a full download of he site I was after, took a good bit longer than usual though.

how can I go on about doing this?

@danest
Copy link

danest commented Nov 9, 2023

Find the location of your gem files using.

gem environment gemdir

Then, CD into that directory. Then go into /gems/wayback_machine_downloader-2.3.1

Once you are in that directory, open the gem in VSCode or any code editor.

Then, I added this code to the method.

  def download_file(file_remote_info)
      random_delay = rand(3..10)
      sleep(random_delay)
      puts "Resumed after #{random_delay} seconds of delay"
      ...
      ...
  end

@YesYesTheDev
Copy link

Sorry to be a hassle but there is a lib and a bin folder which one would I put it under and where would I put it in the file https://cdn2.noschool.work/u/bbxwZLFV6O8arNJPmarl.png

Find the location of your gem files using.

gem environment gemdir

Then, CD into that directory. Then go into /gems/wayback_machine_downloader-2.3.1

Once you are in that directory, open the gem in VSCode or any code editor.

Then, I added this code to the method.

  def download_file(file_remote_info)
      random_delay = rand(3..10)
      sleep(random_delay)
      puts "Resumed after #{random_delay} seconds of delay"
      ...
      ...
  end

@danest
Copy link

danest commented Nov 9, 2023

You want to edit the file here lib/wayback_machine_downloader.rb

@YesYesTheDev
Copy link

You want to edit the file here lib/wayback_machine_downloader.rb

Still isn't working it's this is what it's showing https://cdn2.noschool.work/u/zxPMiLtqgBV3oWvFrJFP.png

@intercoop
Copy link
Author

intercoop commented Nov 10, 2023

It looks like they're using rate limiting on requests, I cracked open the gem and put a random 3-10s delay in WaybackMachineDownloader#download_file and managed to get a full download of he site I was after, took a good bit longer than usual though.

Thanks,but I don't have this problem with wget,Does it have no rate limiting on wget requests ?
like this command: wget -rc --recursive --no-clobber --page-requisites --convert-links --domains web.archive.org --no-parent --accept-regex '.https://www.youtube.com/.' https://web.archive.org/web/20220524143753/https://www.youtube.com/
It can run smoothly without any delay ,Thank you for your kind reply

@pattasrinivasvarma
Copy link

Last month, I ran the wayback_machine_downloader normally ok ,But starting from yesterday,I tried many domain names, each returned result was a connection refused, The command like this : wayback_machine_downloader http://huzhan.com --concurrency 3 -t 20220525005404 -a The corresponding result like this take a look below: https://www.huzhan.com/code/goods377071.html -> websites/huzhan.com/code/goods377071.html (280/112619) https://www.huzhan.com/serve/goods14529.html -> websites/huzhan.com/serve/goods14529.html (281/112619) https://www.huzhan.com/serve/goods12899.html # Connection refused - connect(2) https://www.huzhan.com/serve/goods12899.html -> websites/huzhan.com/serve/goods12899.html (282/112619) https://www.huzhan.com/ishop42980/ # Connection refused - connect(2) https://www.huzhan.com/ishop42980/ -> websites/huzhan.com/ishop42980/index.html (283/112619) https://www.huzhan.com/code/goods421671.html # Connection refused - connect(2) https://www.huzhan.com/code/goods421671.html -> websites/huzhan.com/code/goods421671.html (284/112619) https://www.huzhan.com/serve/goods15588.html # Connection refused - connect(2) https://www.huzhan.com/serve/goods15588.html -> websites/huzhan.com/serve/goods15588.html (285/112619) https://www.huzhan.com/serve/goods15287.html # Connection refused - connect(2) https://www.huzhan.com/serve/goods15287.html -> websites/huzhan.com/serve/goods15287.html (286/112619) https://www.huzhan.com/code/goods420832.html # Connection refused - connect(2) https://www.huzhan.com/code/goods420832.html -> websites/huzhan.com/code/goods420832.html (287/112619) https://www.huzhan.com/ishop37725/ # Connection refused - connect(2) https://www.huzhan.com/ishop37725/ -> websites/huzhan.com/ishop37725/index.html (288/112619) https://www.huzhan.com/code/goods372252.html # Connection refused - connect(2) https://www.huzhan.com/code/goods372252.html -> websites/huzhan.com/code/goods372252.html (289/112619) https://www.huzhan.com/code/goods418192.html # Connection refused - connect(2) https://www.huzhan.com/ishop21789/ # Connection refused - connect(2) https://www.huzhan.com/code/goods418192.html -> websites/huzhan.com/code/goods418192.html (290/112619) https://www.huzhan.com/ishop21789/ -> websites/huzhan.com/ishop21789/index.html (291/112619) https://www.huzhan.com/code/goods354759.html # Connection refused - connect(2) https://www.huzhan.com/code/goods354759.html -> websites/huzhan.com/code/goods354759.html (292/112619) https://www.huzhan.com/code/goods421676.html # Connection refused - connect(2) https://www.huzhan.com/code/goods421676.html -> websites/huzhan.com/code/goods421676.html (293/112619) https://www.huzhan.com/code/goods412576.html # Connection refused - connect(2) https://www.huzhan.com/ishop40294/ # Connection refused - connect(2) https://www.huzhan.com/code/goods412576.html -> websites/huzhan.com/code/goods412576.html (294/112619) https://www.huzhan.com/ishop40294/ -> websites/huzhan.com/ishop40294/index.html (295/112619) https://www.huzhan.com/ishop40283/ # Connection refused - connect(2) https://www.huzhan.com/ishop40283/ -> websites/huzhan.com/ishop40283/index.html (296/112619) https://www.huzhan.com/serve/goods15226.html # Connection refused - connect(2) https://www.huzhan.com/serve/goods15226.html -> websites/huzhan.com/serve/goods15226.html (297/112619) https://www.huzhan.com/ishop44505/ # Connection refused - connect(2) https://www.huzhan.com/ishop44505/ -> websites/huzhan.com/ishop44505/index.html (298/112619) https://www.huzhan.com/code/goods410194.html # Connection refused - connect(2) https://www.huzhan.com/code/goods410194.html -> websites/huzhan.com/code/goods410194.html (299/112619) https://www.huzhan.com/ishop41272/ # Connection refused - connect(2) https://www.huzhan.com/serve/goods15735.html # Connection refused - connect(2) https://www.huzhan.com/ishop41272/ -> websites/huzhan.com/ishop41272/index.html (300/112619) https://www.huzhan.com/serve/goods15735.html -> websites/huzhan.com/serve/goods15735.html (301/112619) https://www.huzhan.com/code/goods420725.html # Connection refused - connect(2) https://www.huzhan.com/code/goods420725.html -> websites/huzhan.com/code/goods420725.html (302/112619) https://www.huzhan.com/ishop43261/ # Connection refused - connect(2) https://www.huzhan.com/ishop43261/ -> websites/huzhan.com/ishop43261/index.html (303/112619) https://www.huzhan.com/serve/goods15565.html # Connection refused - connect(2) https://www.huzhan.com/serve/goods15565.html -> websites/huzhan.com/serve/goods15565.html (304/112619) https://www.huzhan.com/ishop44358/ # Connection refused - connect(2) https://www.huzhan.com/ishop44358/ -> websites/huzhan.com/ishop44358/index.html (305/112619) https://www.huzhan.com/code/page/4 # Connection refused - connect(2) https://www.huzhan.com/ishop7456/ # Connection refused - connect(2)

then,I get lots files is empty, Did the archive website implement controls to prevent crawling? Because I can access it normally using a browser,Similarly I can also obtain the files by Wget tool,Thank you for following this issue !

im facing the same issue, any solution

@jcharaoui
Copy link

Since 2019 they are limiting requests to 15 per minute: https://archive.org/details/toomanyrequests_20191110

Therefore, adding a static 4-second delay works to avoid any connection refused errors.

nicholascc added a commit to nicholascc/wayback-machine-downloader that referenced this issue Nov 12, 2023
Currently a bit of a hack - there should probably be a configurable delay parameter.
@ee3e ee3e mentioned this issue Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants