Crawl resumed from saved state revisits already done pages #491

ato · 2024-03-11T06:05:55Z

If I run a crawl and then send the node.js crawl process a SIGINT it writes a state YAML file into the crawls/ directory. The README states that:

The idea is that this crawl state YAML file can then be used as --config option to restart the crawl from where it was left of previously.

However the state file only seems to contain the list of queued URLs and does not include any that are already done:

state:
  done: 2
  queued:
    - '{"added":"2024-03-11T05:36:42.324Z","url":"https://site.example/page3","seedId":0,"depth":1}'
    - '{"added":"2024-03-11T05:36:42.324Z","url":"https://site.example/page4","seedId":0,"depth":1}'
  pending: []
  failed: []
  errors: []

When passing a state file to the --config option browsertrix-crawler seems to recrawl the entire site in a slightly different order. As far as I can tell the second run doesn't know which pages were done in the first run so it just queues them up again as soon as it encounters a link to them.

My expectation was that stopping and resuming from a state file should be roughly equivalent in terms of captured data to a crawl that was just never stopped.

Example:

$ podman run -it --rm -v $PWD:/crawls/ webrecorder/browsertrix-crawler:1.0.0-beta.7 crawl --id test -c test --combinewarc --generatecdx --url https://www.meshy.org/

In a different terminal stop the crawl by sending SIGINT to the crawl process. (Pressing CTRL+C doesn't exit gracefully as it kills the browser. Maybe that's podman/docker difference.)

$ pkill -INT -f /bin/crawl

Resume from state file

$ cat collections/test/crawls/crawl-20240311062844-test.yaml
state:
  done: 3
  queued:
    - '{"added":"2024-03-11T06:28:34.511Z","url":"https://www.meshy.org/blog/pywb-migration/","seedId":0,"depth":1,"extraHops":0}'
  pending: []
  failed: []
  errors: []
$ podman run -it --rm -v $PWD:/crawls/ webrecorder/browsertrix-crawler:1.0.0-beta.7 crawl --id test -c test --combinewarc --generatecdx --url https://www.meshy.org/ --config collections/test/crawls/crawl-20240311062844-test.yaml

Confirm that the same page was captured twice, once in each run:

$ grep '^org,meshy)/blog/outbackcdx-replication ' collections/test/indexes/index.cdxj
org,meshy)/blog/outbackcdx-replication 20240311062915 {"url": "https://www.meshy.org/blog/outbackcdx-replication/", "mime": "text/html", "status": "200", "digest": "sha256:3c91aad9db5f772528c32ffae302fc059d3e31de78c9235ce149d93bceac3c38", "length": "2155", "offset": "28193", "filename": "rec-b8254df4d6ed-20240311062902253-0.warc.gz"}
org,meshy)/blog/outbackcdx-replication 20240311062840 {"url": "https://www.meshy.org/blog/outbackcdx-replication/", "mime": "text/html", "status": "200", "digest": "sha256:3c91aad9db5f772528c32ffae302fc059d3e31de78c9235ce149d93bceac3c38", "length": "2160", "offset": "7558", "filename": "rec-dc8c7a02b0f6-20240311062832904-0.warc.gz"}

The text was updated successfully, but these errors were encountered:

ato · 2024-03-11T07:53:12Z

It seems the state parser will accept a list of URLs in the done section instead of a count. So the following workaround seems to allow resuming a crawl without revisiting already done pages.

We just take the done URLs from pages.jsonl and add them into done: section in the state file, like so:

state:
  done: 
    - '{"url":"https://www.meshy.org/"}'
    - '{"url":"https://www.meshy.org/blog/oracle-unicode/"}'
  queued:
    - '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/outbackcdx-replication/","seedId":0,"depth":1,"extraHops":0}'
    - '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/pywb-migration/","seedId":0,"depth":1,"extraHops":0}'
  pending: []
  failed: []
  errors: []

Example script:

#!/usr/bin/python3
import json, sys, yaml

if len(sys.argv) < 3: sys.exit("Usage: workaround.py pages.jsonl crawl-state.yaml > crawl-state-done.yaml")

pages_file = sys.argv[1]
state_file = sys.argv[2]

urls = set()
with open(pages_file) as f:
    f.readline() # skip header
    for line in f:
        urls.add(json.loads(line)['url'])

with open(state_file) as f:
    data = yaml.safe_load(f)
    data['state']['done'] = [json.dumps({'url': url}) for url in urls]
    yaml.safe_dump(data,default_flow_style=False, stream=sys.stdout)

- ensure seen urls that were done still added to 'doneUrls' list, fixes #491 - ensure extraSeeds added from redirects also added to redis and serialized

ikreymer · 2024-03-15T02:33:23Z

Thanks for reporting, indeed, this was an oversight in a previous refactor. The done array was no longer being kept to save memory, but of course need to have successfully finished / done / seen set to avoid recrawling previous URLs.

#495 fixes this by recomputing the finished list of pages URLs (taking seen set subtracting the queued and failed URLs).

ikreymer · 2024-03-15T02:33:52Z

@ato is it ok for this to be in 1.0.0 release? Which version are you using?

ato · 2024-03-15T05:42:35Z

We're on 0.12.4 but I already implemented the workaround I described above and that's working good enough for now. So yeah if the fix is in 1.0.0 that's fine, I'll just delete the workaround when we upgrade. :-)

Really nice that it adds extra seeds on redirects too. That's actually something I was thinking about how we'd have to handle when switching more of our crawls from Heritrix.

ikreymer · 2024-03-15T06:10:23Z

@ato great! If you have a chance to test the 1.0.0, would welcome additional feedback! In 1.0.0, we use CDP entirely for capture and includes various other fixes (generally should work better!)

- Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491

tw4l assigned tw4l and ikreymer and unassigned tw4l Mar 13, 2024

ikreymer added a commit that referenced this issue Mar 15, 2024

fix save load state:

34fda06

- ensure seen urls that were done still added to 'doneUrls' list, fixes #491 - ensure extraSeeds added from redirects also added to redis and serialized

ikreymer mentioned this issue Mar 15, 2024

Fix Save/Load State #495

Merged

ikreymer closed this as completed in #495 Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl resumed from saved state revisits already done pages #491

Crawl resumed from saved state revisits already done pages #491

ato commented Mar 11, 2024 •

edited

ato commented Mar 11, 2024

ikreymer commented Mar 15, 2024

ikreymer commented Mar 15, 2024

ato commented Mar 15, 2024 •

edited

ikreymer commented Mar 15, 2024

Crawl resumed from saved state revisits already done pages #491

Crawl resumed from saved state revisits already done pages #491

Comments

ato commented Mar 11, 2024 • edited

ato commented Mar 11, 2024

ikreymer commented Mar 15, 2024

ikreymer commented Mar 15, 2024

ato commented Mar 15, 2024 • edited

ikreymer commented Mar 15, 2024

ato commented Mar 11, 2024 •

edited

ato commented Mar 15, 2024 •

edited