Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl resumed from saved state revisits already done pages #491

Closed
ato opened this issue Mar 11, 2024 · 5 comments · Fixed by #495
Closed

Crawl resumed from saved state revisits already done pages #491

ato opened this issue Mar 11, 2024 · 5 comments · Fixed by #495
Assignees

Comments

@ato
Copy link

ato commented Mar 11, 2024

If I run a crawl and then send the node.js crawl process a SIGINT it writes a state YAML file into the crawls/ directory. The README states that:

The idea is that this crawl state YAML file can then be used as --config option to restart the crawl from where it was left of previously.

However the state file only seems to contain the list of queued URLs and does not include any that are already done:

state:
  done: 2
  queued:
    - '{"added":"2024-03-11T05:36:42.324Z","url":"https://site.example/page3","seedId":0,"depth":1}'
    - '{"added":"2024-03-11T05:36:42.324Z","url":"https://site.example/page4","seedId":0,"depth":1}'
  pending: []
  failed: []
  errors: []

When passing a state file to the --config option browsertrix-crawler seems to recrawl the entire site in a slightly different order. As far as I can tell the second run doesn't know which pages were done in the first run so it just queues them up again as soon as it encounters a link to them.

My expectation was that stopping and resuming from a state file should be roughly equivalent in terms of captured data to a crawl that was just never stopped.

Example:

$ podman run -it --rm -v $PWD:/crawls/ webrecorder/browsertrix-crawler:1.0.0-beta.7 crawl --id test -c test --combinewarc --generatecdx --url https://www.meshy.org/

In a different terminal stop the crawl by sending SIGINT to the crawl process. (Pressing CTRL+C doesn't exit gracefully as it kills the browser. Maybe that's podman/docker difference.)

$ pkill -INT -f /bin/crawl

Resume from state file

$ cat collections/test/crawls/crawl-20240311062844-test.yaml
state:
  done: 3
  queued:
    - '{"added":"2024-03-11T06:28:34.511Z","url":"https://www.meshy.org/blog/pywb-migration/","seedId":0,"depth":1,"extraHops":0}'
  pending: []
  failed: []
  errors: []
$ podman run -it --rm -v $PWD:/crawls/ webrecorder/browsertrix-crawler:1.0.0-beta.7 crawl --id test -c test --combinewarc --generatecdx --url https://www.meshy.org/ --config collections/test/crawls/crawl-20240311062844-test.yaml

Confirm that the same page was captured twice, once in each run:

$ grep '^org,meshy)/blog/outbackcdx-replication ' collections/test/indexes/index.cdxj
org,meshy)/blog/outbackcdx-replication 20240311062915 {"url": "https://www.meshy.org/blog/outbackcdx-replication/", "mime": "text/html", "status": "200", "digest": "sha256:3c91aad9db5f772528c32ffae302fc059d3e31de78c9235ce149d93bceac3c38", "length": "2155", "offset": "28193", "filename": "rec-b8254df4d6ed-20240311062902253-0.warc.gz"}
org,meshy)/blog/outbackcdx-replication 20240311062840 {"url": "https://www.meshy.org/blog/outbackcdx-replication/", "mime": "text/html", "status": "200", "digest": "sha256:3c91aad9db5f772528c32ffae302fc059d3e31de78c9235ce149d93bceac3c38", "length": "2160", "offset": "7558", "filename": "rec-dc8c7a02b0f6-20240311062832904-0.warc.gz"}
@ato
Copy link
Author

ato commented Mar 11, 2024

It seems the state parser will accept a list of URLs in the done section instead of a count. So the following workaround seems to allow resuming a crawl without revisiting already done pages.

We just take the done URLs from pages.jsonl and add them into done: section in the state file, like so:

state:
  done: 
    - '{"url":"https://www.meshy.org/"}'
    - '{"url":"https://www.meshy.org/blog/oracle-unicode/"}'
  queued:
    - '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/outbackcdx-replication/","seedId":0,"depth":1,"extraHops":0}'
    - '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/pywb-migration/","seedId":0,"depth":1,"extraHops":0}'
  pending: []
  failed: []
  errors: []

Example script:

#!/usr/bin/python3
import json, sys, yaml

if len(sys.argv) < 3: sys.exit("Usage: workaround.py pages.jsonl crawl-state.yaml > crawl-state-done.yaml")

pages_file = sys.argv[1]
state_file = sys.argv[2]

urls = set()
with open(pages_file) as f:
    f.readline() # skip header
    for line in f:
        urls.add(json.loads(line)['url'])

with open(state_file) as f:
    data = yaml.safe_load(f)
    data['state']['done'] = [json.dumps({'url': url}) for url in urls]
    yaml.safe_dump(data,default_flow_style=False, stream=sys.stdout)

@tw4l tw4l assigned tw4l and ikreymer and unassigned tw4l Mar 13, 2024
ikreymer added a commit that referenced this issue Mar 15, 2024
- ensure seen urls that were done still added to 'doneUrls' list, fixes #491
- ensure extraSeeds added from redirects also added to redis and serialized
@ikreymer
Copy link
Member

Thanks for reporting, indeed, this was an oversight in a previous refactor. The done array was no longer being kept to save memory, but of course need to have successfully finished / done / seen set to avoid recrawling previous URLs.

#495 fixes this by recomputing the finished list of pages URLs (taking seen set subtracting the queued and failed URLs).

@ikreymer
Copy link
Member

@ato is it ok for this to be in 1.0.0 release? Which version are you using?

@ato
Copy link
Author

ato commented Mar 15, 2024

We're on 0.12.4 but I already implemented the workaround I described above and that's working good enough for now. So yeah if the fix is in 1.0.0 that's fine, I'll just delete the workaround when we upgrade. :-)

Really nice that it adds extra seeds on redirects too. That's actually something I was thinking about how we'd have to handle when switching more of our crawls from Heritrix.

@ikreymer
Copy link
Member

@ato great! If you have a chance to test the 1.0.0, would welcome additional feedback! In 1.0.0, we use CDP entirely for capture and includes various other fixes (generally should work better!)

ikreymer added a commit that referenced this issue Mar 16, 2024
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

Successfully merging a pull request may close this issue.

3 participants