Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download fails #267

Open
ingvarr777 opened this issue Nov 12, 2023 · 7 comments · May be fixed by #268 or #280
Open

Download fails #267

ingvarr777 opened this issue Nov 12, 2023 · 7 comments · May be fixed by #268 or #280

Comments

@ingvarr777
Copy link

Can't download anything lately.
Here's an example:

wayback_machine_downloader example.com
Downloading example.com to websites/example.com/ from Wayback Machine archives.

Getting snapshot pages................... found 25580 snaphots to consider.

5 files to download:
https://www.example.com/ # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/index.html was empty and was removed.
https://www.example.com/ -> websites/example.com/index.html (1/5)
http://www.example.com/? # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/?/index.html was empty and was removed.
http://www.example.com/? -> websites/example.com/?/index.html (2/5)
http://example.com/%2F/ # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com//index.html was empty and was removed.
http://example.com/%2F/ -> websites/example.com//index.html (3/5)
http://example.com/#main # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/#main/index.html was empty and was removed.
http://example.com/#main -> websites/example.com/#main/index.html (4/5)
http://example.com/#/login # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/#/login/index.html was empty and was removed.
http://example.com/#/login -> websites/example.com/#/login/index.html (5/5)

What I get as a result is a bunch of empty folders. Does anyone have a solution?

@jomo06
Copy link

jomo06 commented Nov 14, 2023

same here - guessing that wayback is breaking the connection after a small handful of requests...mine worked for the first 19 pages, then it began to fail...

@sww1235 sww1235 linked a pull request Nov 16, 2023 that will close this issue
@ingvarr777
Copy link
Author

This fix does work. It's a bit slow now of course, but the files get downloaded.

@sww1235
Copy link

sww1235 commented Nov 20, 2023

archive.org has implemented rate limiting, which is why the delay fixes things. It is unfortunate, and probably breaks multithreaded downloading as well, but it is a free resource after all. https://archive.org/details/toomanyrequests_20191110

@technomaz
Copy link

can we get this fix approved and a new release created?

@ee3e
Copy link

ee3e commented Dec 22, 2023

As far as I can tell archive.org is limiting the number of connections you can make in a short period of time.

As mentioned in #264, browsers and wget (which uses persistent connection) is not affected by this issue.

It should be fixed by using a single persistent connection for all downloads instead of creating a new connection for each download.

diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
     @processed_file_count = 0
     @threads_count = 1 unless @threads_count != 0
     @threads_count.times do
+      http = Net::HTTP.new("web.archive.org", 443)
+      http.use_ssl = true
+      http.start()
       threads << Thread.new do
         until file_queue.empty?
           file_remote_info = file_queue.pop(true) rescue nil
-          download_file(file_remote_info) if file_remote_info
+          download_file(file_remote_info, http) if file_remote_info
         end
+        http.finish()
       end
     end

@@ -243,7 +247,7 @@ class WaybackMachineDownloader
     end
   end

-  def download_file file_remote_info
+  def download_file (file_remote_info, http)
     current_encoding = "".encoding
     file_url = file_remote_info[:file_url].encode(current_encoding)
     file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
         structure_dir_path dir_path
         open(file_path, "wb") do |file|
           begin
-            URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
-              file.write(uri.read)
+            http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+              file.write(body)
             end
           rescue OpenURI::HTTPError => e
             puts "#{file_url} # #{e}"

@JXGA
Copy link

JXGA commented Jan 9, 2024

diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
     @processed_file_count = 0
     @threads_count = 1 unless @threads_count != 0
     @threads_count.times do
+      http = Net::HTTP.new("web.archive.org", 443)
+      http.use_ssl = true
+      http.start()
       threads << Thread.new do
         until file_queue.empty?
           file_remote_info = file_queue.pop(true) rescue nil
-          download_file(file_remote_info) if file_remote_info
+          download_file(file_remote_info, http) if file_remote_info
         end
+        http.finish()
       end
     end

@@ -243,7 +247,7 @@ class WaybackMachineDownloader
     end
   end

-  def download_file file_remote_info
+  def download_file (file_remote_info, http)
     current_encoding = "".encoding
     file_url = file_remote_info[:file_url].encode(current_encoding)
     file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
         structure_dir_path dir_path
         open(file_path, "wb") do |file|
           begin
-            URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
-              file.write(uri.read)
+            http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+              file.write(body)
             end
           rescue OpenURI::HTTPError => e
             puts "#{file_url} # #{e}"

This is an elegant (and working) solution. Nice one!

n1zyy added a commit to n1zyy/wayback-machine-downloader that referenced this issue Feb 3, 2024
Connections are limited to 15/minute. More will lead to
a "Connection refused" error. Take ee3e's advice and just
use a persistent connection:
hartator#267 (comment)
@ShiftaDeband
Copy link

Thank you @ee3e!

Similarly, this should be implemented for get_all_snapshots_to_consider:

In wayback_machine_downloader.rb:

  def get_all_snapshots_to_consider
    # Note: Passing a page index parameter allow us to get more snapshots,
    # but from a less fresh index
    http = Net::HTTP.new("web.archive.org", 443)
    http.use_ssl = true
    http.start()
    print "Getting snapshot pages"
    snapshot_list_to_consider = []
    snapshot_list_to_consider += get_raw_list_from_api(@base_url, nil, http)
    print "."
    unless @exact_url
      @maximum_pages.times do |page_index|
        snapshot_list = get_raw_list_from_api(@base_url + '/*', page_index, http)
        break if snapshot_list.empty?
        snapshot_list_to_consider += snapshot_list
        print "."
      end
    end
    http.finish()
    puts " found #{snapshot_list_to_consider.length} snaphots to consider."
    puts
    snapshot_list_to_consider
  end

...and in archive_api.rb:

  def get_raw_list_from_api url, page_index, http
    request_url = URI("https://web.archive.org/cdx/search/xd")
    params = [["output", "json"], ["url", url]]
    params += parameters_for_api page_index
    request_url.query = URI.encode_www_form(params)

    begin
      json = JSON.parse(http.get(URI(request_url)).body)
      if (json[0] <=> ["timestamp","original"]) == 0
        json.shift
      end
      json
    rescue JSON::ParserError
      []
    end
  end

(Please check my code, but it worked for me to download a very large archive that I've been struggling with for a bit.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants