New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl resumed from saved state revisits already done pages #491
Comments
It seems the state parser will accept a list of URLs in the We just take the done URLs from pages.jsonl and add them into state:
done:
- '{"url":"https://www.meshy.org/"}'
- '{"url":"https://www.meshy.org/blog/oracle-unicode/"}'
queued:
- '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/outbackcdx-replication/","seedId":0,"depth":1,"extraHops":0}'
- '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/pywb-migration/","seedId":0,"depth":1,"extraHops":0}'
pending: []
failed: []
errors: [] Example script: #!/usr/bin/python3
import json, sys, yaml
if len(sys.argv) < 3: sys.exit("Usage: workaround.py pages.jsonl crawl-state.yaml > crawl-state-done.yaml")
pages_file = sys.argv[1]
state_file = sys.argv[2]
urls = set()
with open(pages_file) as f:
f.readline() # skip header
for line in f:
urls.add(json.loads(line)['url'])
with open(state_file) as f:
data = yaml.safe_load(f)
data['state']['done'] = [json.dumps({'url': url}) for url in urls]
yaml.safe_dump(data,default_flow_style=False, stream=sys.stdout) |
- ensure seen urls that were done still added to 'doneUrls' list, fixes #491 - ensure extraSeeds added from redirects also added to redis and serialized
Thanks for reporting, indeed, this was an oversight in a previous refactor. The done array was no longer being kept to save memory, but of course need to have successfully finished / done / seen set to avoid recrawling previous URLs. #495 fixes this by recomputing the finished list of pages URLs (taking seen set subtracting the queued and failed URLs). |
@ato is it ok for this to be in 1.0.0 release? Which version are you using? |
We're on 0.12.4 but I already implemented the workaround I described above and that's working good enough for now. So yeah if the fix is in 1.0.0 that's fine, I'll just delete the workaround when we upgrade. :-) Really nice that it adds extra seeds on redirects too. That's actually something I was thinking about how we'd have to handle when switching more of our crawls from Heritrix. |
@ato great! If you have a chance to test the 1.0.0, would welcome additional feedback! In 1.0.0, we use CDP entirely for capture and includes various other fixes (generally should work better!) |
- Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491
If I run a crawl and then send the node.js crawl process a SIGINT it writes a state YAML file into the
crawls/
directory. The README states that:However the state file only seems to contain the list of queued URLs and does not include any that are already done:
When passing a state file to the --config option browsertrix-crawler seems to recrawl the entire site in a slightly different order. As far as I can tell the second run doesn't know which pages were done in the first run so it just queues them up again as soon as it encounters a link to them.
My expectation was that stopping and resuming from a state file should be roughly equivalent in terms of captured data to a crawl that was just never stopped.
Example:
In a different terminal stop the crawl by sending SIGINT to the crawl process. (Pressing CTRL+C doesn't exit gracefully as it kills the browser. Maybe that's podman/docker difference.)
Resume from state file
Confirm that the same page was captured twice, once in each run:
The text was updated successfully, but these errors were encountered: