Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT USE unless you have a means of rate limiting yourself #281

Open
jdimpson opened this issue Mar 7, 2024 · 8 comments
Open

DO NOT USE unless you have a means of rate limiting yourself #281

jdimpson opened this issue Mar 7, 2024 · 8 comments

Comments

@jdimpson
Copy link

jdimpson commented Mar 7, 2024

The Wayback Machine is (rightfully) blocking bulk downloads that exceed too much bandwidth or requests per secon. As far as I can tell, this product does no rate-limiting of itself, at least not by default, per any examples in the README. As a result, the Internet Archive will soft ban your IP address if you use this script on a web site of any significant size.

It's irresponsible to leave this repository up without at least a warning in the documentation.

@tinyapps
Copy link

tinyapps commented Mar 7, 2024

See ShiftaDeband's fork (which contains the fixes mentioned in his PR) as well as issues #273 and #275.

@Elmagenta
Copy link

See ShiftaDeband's fork (which contains the fixes mentioned in his PR) as well as issues #273 and #275.

Sorry to bother, i'm pretty new in this, how can i actually use this fork instead of the master branch?

@tinyapps
Copy link

tinyapps commented Apr 6, 2024

@Elmagenta: You'll need to have Ruby installed then you can just download ShiftaDeband's fork as a ZIP file, unzip it, and run wayback_machine_downloader which you'll find in the bin subdirectory.

@flag-br
Copy link

flag-br commented Apr 7, 2024

@tinyapps I'm also pretty new in this, and I couldn't follow your instructions. I have Ruby installed, and I had also installed the "original" wayback_machine_downloader via Mac OS Terminal. Now, following your instructions, I downloaded the ZIP file and simply tried to run the binary file. But I get an error message

/Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/bin/wayback_machine_downloader:3:in `require_relative': cannot load such file -- /Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/lib/wayback_machine_downloader (LoadError)
from /Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/bin/wayback_machine_downloader:3:in "

"

Could you give more details on how to proceed?

@tinyapps
Copy link

tinyapps commented Apr 7, 2024

@flag-br: Sounds like you might've deleted (or not extracted) the included lib directory or its contents. After unzipping wayback-machine-downloader-feature-httpGet.zip, just cd into the bin subdirectory and run wayback_machine_downloader without deleting any of the other included files or folders. The directory structure should look like this:

.
├── Dockerfile
├── Gemfile
├── MIT-LICENSE.txt
├── README.md
├── Rakefile
├── bin
│   └── wayback_machine_downloader
├── lib
│   ├── wayback_machine_downloader
│   │   ├── archive_api.rb
│   │   ├── tidy_bytes.rb
│   │   └── to_regex.rb
│   └── wayback_machine_downloader.rb
├── test
│   └── test_wayback_machine_downloader.rb
└── wayback_machine_downloader.gemspec

@flag-br
Copy link

flag-br commented Apr 8, 2024

@tinyapps Thank you very much, it worked! It ran normally, but the final product is practically the same as what I was getting before with the master branch version. The folder structure apparently reproduced correctly on my machine, but only 15 htm files were downloaded. To check, I ran wayback_machine_downloader with the --list option, and the answer is that there are 1116 htm files.

The command I'm using is (after cd to bin folder): wayback_machine_downloader https://jazzdiscogcorner.pagesperso-orange.fr/

This site is quite simple, just text and practically no images.

Am I doing something wrong?

@tinyapps
Copy link

tinyapps commented Apr 8, 2024

@flag-br: Glad to hear it worked out. As for issues with a specific site, I'd recommend checking out the documentation and searching through the open and closed issues before posting a new issue.

@eggplantedd
Copy link

eggplantedd commented Apr 29, 2024

@flag-br: Sounds like you might've deleted (or not extracted) the included lib directory or its contents. After unzipping wayback-machine-downloader-feature-httpGet.zip, just cd into the bin subdirectory and run wayback_machine_downloader without deleting any of the other included files or folders. The directory structure should look like this:

.
├── Dockerfile
├── Gemfile
├── MIT-LICENSE.txt
├── README.md
├── Rakefile
├── bin
│   └── wayback_machine_downloader
├── lib
│   ├── wayback_machine_downloader
│   │   ├── archive_api.rb
│   │   ├── tidy_bytes.rb
│   │   └── to_regex.rb
│   └── wayback_machine_downloader.rb
├── test
│   └── test_wayback_machine_downloader.rb
└── wayback_machine_downloader.gemspec

I'm being stupid here, but trying to run wayback_machine_downloader (type - file) in the bin directory gave me "not recognized as an internal or external command, operable program or batch file". Fresh Ruby install.

I had to gem build wayback_machine_downloader.gemspec, then gem install wayback_machine_downloader-2.3.2.gem that was generated, and finally I could run wayback_machine_downloader from cmd in a working fashion. Any advice on what I was doing wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants