Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UserAgent in parameters #170

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ It will download the last version of every file present on Wayback Machine to `.
-p, --maximum-snapshot NUMBER Maximum snapshot pages to consider (Default is 100)
Count an average of 150,000 snapshots per page
-l, --list Only list file urls in a JSON format with the archived timestamps, won't download anything
-u, --user-agent STRING UserAgent for connection (Default is Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0)

## Specify directory to save files to

Expand Down Expand Up @@ -175,6 +176,16 @@ Example:

wayback_machine_downloader http://example.com --concurrency 20

## Specify UserAgent for connection

-u, --user-agent STRING

UserAgent for connection (Default is Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0)

Example:

wayback_machine_downloader http://example.com --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"

## Using the Docker image

As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way:
Expand Down
4 changes: 4 additions & 0 deletions bin/wayback_machine_downloader
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ option_parser = OptionParser.new do |opts|
options[:list] = true
end

opts.on("-u", "--user-agent STRING", String, "UserAgent for connection (Default is Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0)") do |t|
options[:user_agent] = t
end

opts.on("-v", "--version", "Display version") do |t|
options[:version] = t
end
Expand Down
5 changes: 3 additions & 2 deletions lib/wayback_machine_downloader.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ class WaybackMachineDownloader

attr_accessor :base_url, :exact_url, :directory, :all_timestamps,
:from_timestamp, :to_timestamp, :only_filter, :exclude_filter,
:all, :maximum_pages, :threads_count
:all, :maximum_pages, :threads_count, :user_agent

def initialize params
@base_url = params[:base_url]
Expand All @@ -32,6 +32,7 @@ def initialize params
@all = params[:all]
@maximum_pages = params[:maximum_pages] ? params[:maximum_pages].to_i : 100
@threads_count = params[:threads_count].to_i
@user_agent = params[:user_agent] ? params[:user_agent] : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0"
end

def backup_name
Expand Down Expand Up @@ -268,7 +269,7 @@ def download_file file_remote_info
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
URI.open("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}", "Accept-Encoding" => "plain") do |uri|
open("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}", "Accept-Encoding" => "plain", "User-Agent" => @user_agent) do |uri|
file.write(uri.read)
end
rescue OpenURI::HTTPError => e
Expand Down
2 changes: 1 addition & 1 deletion lib/wayback_machine_downloader/archive_api.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ def get_raw_list_from_api url, page_index
request_url += url
request_url += parameters_for_api page_index

URI.open(request_url).read
open(request_url, "User-Agent" => @user_agent).read
end

def parameters_for_api page_index
Expand Down