How do I add extractors to a pre-existing entry? (and other questions) #1332

Opening-Button-8988 · 2024-01-20T15:23:23Z

Opening-Button-8988
Jan 20, 2024

Lets say I only have singlefile and title downloaded. I want to download a pdf for the entry. In the snapshots page, it gives me the option to save a pdf but I have no idea what to specify under columns CMD, PWD (same as the others I guess?), CMD VERSION, OUTPUT, START TS, END TS, and STATUS. The CMD column is the most complicated, I have no clue what to put there to make it save pdfs. When I just keep those columns empty it gives an error telling me that I need to fill in those columns ("Please correct the errors below").

I really wish clicking on the file icon, under Files Saved column in the webgui, would automatically download that file if it doesn't already exist. It doesn't make much sense that clicking on the icon takes you to a page that tells you it doesn't exist, and shows you this:

Snapshot <mysnapshot_id> exists in DB, but resource <mysnapshot_id>/output.pdf does not exist in snapshot dir yet.

Maybe this resource type is not availabe for this Snapshot,
or the archiving process has not completed yet?
# run this cmd to finish archiving this Snapshot
archivebox update -t timestamp <mysnapshot_id>

Firstly, it should explain that archivebox update -t timestamp <mysnapshot_id> won't work on your host, and list several ways of doing this such as docker compose run archivebox update -t timestamp <mysnapshot_id>. A command is of no use if it is incomplete.

I ran docker compose run archivebox update -t timestamp <mysnapshot_id> but it fetches all extractors. How do I only fetch using a specific extractor?

I've read the wiki, I don't believe there are any instructions on how to do this, but I may have missed it.

Also, by default I would like only singlefile and title downloaded using URL List fetch method, selecting these options every time when adding URLs would be tiresome, is there a way to modify the default behavior in the webgui?

Also I've noticed fetching takes a really long time. It takes about 30 seconds per bookmark, on average. I have gigabit speeds. A pdf will take ages, for example, while downloading that same pdf manually through my browser downloads it in an instant. Why does it take so long? And half my URLs don't fetch singlefile's at all. I can't figure out why. I'm able to view it in the browser.

pirate · 2024-01-23T23:19:04Z

pirate
Jan 23, 2024
Maintainer

To only fetch using specific extractors, e.g. pdf & singlefile, run:

docker compose run archivebox add --extract=pdf,sinflefile 'https://example.com/some/new/url'

# or, to update an existing snapshot:

docker compose run archivebox update -t timestamp --extract=pdf,sinflefile <mysnapshot_id>

# all of the CLI options are documented by our `--help` text on each subcommand, e.g.
docker compose run archivebox update --help

In the snapshots page, it gives me the option to save a pdf but I have no idea what to specify under columns CMD, PWD (same as the others I guess?), CMD VERSION, OUTPUT, START TS, END TS, and STATUS

The area you're looking at is an advanced way to manually add an ArchiveResult log entry. Adding a record there will not run that extractor, it will simply create a record as if the extractor ran. It's for manually editing extractor run logs to fix broken entries or to add a run entry manually for files you put in the ./data/archive/<timestamp>/ folder by hand.

I really wish clicking on the file icon, under Files Saved column in the webgui, would automatically download that file if it doesn't already exist.

This is not the desired behavior for most users. I would recommend using the Pull button in the UI instead, which will download all missing extractors for the selected snapshot.

it should explain that archivebox update -t timestamp <mysnapshot_id> won't work on your host, and list several ways of doing this such as docker compose run archivebox update -t timestamp <mysnapshot_id>. A command is of no use if it is incomplete.

ArchiveBox can be run many different ways depending on how you have it installed, it would make these pages very noisy to add every possible way to call archivebox on different systems. Please refer to our usage docs here, which explain how to run archivebox [subcommand] commands on different setups (e.g. docker compose run archivebox [subcommand]): https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#cli-usage

Also note you can run archivebox on both the host and in Docker at the same time, simply pip install archivebox on the host and then you can run one-off commands outside Docker without modification while still keeping the server running in Docker compose.

Also, by default I would like only singlefile and title downloaded using URL List fetch method, selecting these options every time when adding URLs would be tiresome, is there a way to modify the default behavior in the webgui?

This is planned eventually, follow this issue for updates: #826

Also I've noticed fetching takes a really long time. It takes about 30 seconds per bookmark, on average

Thats normal and is required for ArchiveBox to not get blocked as bot traffic, each method uses that time to spawn the necessary subprocesses for each extractor and do preparation and cleanup after. If we hammer each URL to many times in too short a timeframe we will get blocked/rate-limited by servers and ArchiveBox will be basically useless. ArchiveBox jobs can be sped up by splitting up and adding many URLs in parallel jobs (e.g. archivebox add < urls1.txt & archivebox add < urls2.txt & ...), not by running multiple extractors on the same URL any faster.

And half my URLs don't fetch singlefile's at all. I can't figure out why. I'm able to view it in the browser.

Please open an issue and post the full output of docker compose run archivebox version + the output of docker compose run archivebox add --extract=singlefile 'https://example.com/some/url/thats/failing/for/you'.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I add extractors to a pre-existing entry? (and other questions) #1332

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How do I add extractors to a pre-existing entry? (and other questions) #1332

Opening-Button-8988 Jan 20, 2024

Replies: 1 comment

pirate Jan 23, 2024 Maintainer

Opening-Button-8988
Jan 20, 2024

pirate
Jan 23, 2024
Maintainer