How do I add extractors to a pre-existing entry? (and other questions) #1332
Replies: 1 comment
-
To only fetch using specific extractors, e.g. pdf & singlefile, run: docker compose run archivebox add --extract=pdf,sinflefile 'https://example.com/some/new/url'
# or, to update an existing snapshot:
docker compose run archivebox update -t timestamp --extract=pdf,sinflefile <mysnapshot_id>
# all of the CLI options are documented by our `--help` text on each subcommand, e.g.
docker compose run archivebox update --help
The area you're looking at is an advanced way to manually add an ArchiveResult log entry. Adding a record there will not run that extractor, it will simply create a record as if the extractor ran. It's for manually editing extractor run logs to fix broken entries or to add a run entry manually for files you put in the
This is not the desired behavior for most users. I would recommend using the
ArchiveBox can be run many different ways depending on how you have it installed, it would make these pages very noisy to add every possible way to call archivebox on different systems. Please refer to our usage docs here, which explain how to run Also note you can run archivebox on both the host and in Docker at the same time, simply
This is planned eventually, follow this issue for updates: #826
Thats normal and is required for ArchiveBox to not get blocked as bot traffic, each method uses that time to spawn the necessary subprocesses for each extractor and do preparation and cleanup after. If we hammer each URL to many times in too short a timeframe we will get blocked/rate-limited by servers and ArchiveBox will be basically useless. ArchiveBox jobs can be sped up by splitting up and adding many URLs in parallel jobs (e.g.
Please open an issue and post the full output of |
Beta Was this translation helpful? Give feedback.
-
Lets say I only have
singlefile
andtitle
downloaded. I want to download a pdf for the entry. In the snapshots page, it gives me the option to save a pdf but I have no idea what to specify under columnsCMD
,PWD
(same as the others I guess?),CMD VERSION
,OUTPUT
,START TS
,END TS
, andSTATUS
. TheCMD
column is the most complicated, I have no clue what to put there to make it save pdfs. When I just keep those columns empty it gives an error telling me that I need to fill in those columns ("Please correct the errors below").I really wish clicking on the file icon, under Files Saved column in the webgui, would automatically download that file if it doesn't already exist. It doesn't make much sense that clicking on the icon takes you to a page that tells you it doesn't exist, and shows you this:
Firstly, it should explain that
archivebox update -t timestamp <mysnapshot_id>
won't work on your host, and list several ways of doing this such asdocker compose run archivebox update -t timestamp <mysnapshot_id>
. A command is of no use if it is incomplete.I ran
docker compose run archivebox update -t timestamp <mysnapshot_id>
but it fetches all extractors. How do I only fetch using a specific extractor?I've read the wiki, I don't believe there are any instructions on how to do this, but I may have missed it.
Also, by default I would like only
singlefile
andtitle
downloaded using URL List fetch method, selecting these options every time when adding URLs would be tiresome, is there a way to modify the default behavior in the webgui?Also I've noticed fetching takes a really long time. It takes about 30 seconds per bookmark, on average. I have gigabit speeds. A pdf will take ages, for example, while downloading that same pdf manually through my browser downloads it in an instant. Why does it take so long? And half my URLs don't fetch singlefile's at all. I can't figure out why. I'm able to view it in the browser.
Beta Was this translation helpful? Give feedback.
All reactions