Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Save/Load State #495

Merged
merged 6 commits into from Mar 16, 2024
Merged

Fix Save/Load State #495

merged 6 commits into from Mar 16, 2024

Conversation

ikreymer
Copy link
Member

  • Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs.
  • Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via Handle seed redirects #475). Extra seeds are added to Redis and also included in the serialization.

Fixes #491

- ensure seen urls that were done still added to 'doneUrls' list, fixes #491
- ensure extraSeeds added from redirects also added to redis and serialized
…ed list

don't serialize 'done' key
serialize 'extraSeeds' to save state and read back on load
update saved-state test to include a redirect seed to test extraSeeds
@ikreymer ikreymer requested a review from tw4l March 15, 2024 02:23
avoid double init of extraSeeds, only load directly if not loading state
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, tested locally and working well. Thanks for the tests and cleanup here too.

tests/saved-state.test.js Outdated Show resolved Hide resolved
@tw4l
Copy link
Contributor

tw4l commented Mar 15, 2024

I guess only other question is, could we not go back to using done rather than introducing a new finished, since we're already handling it as an int or array or strings for backwards compatibility's sake? No strong opinion here, but it seems we could just undo the breaking change rather than renaming.

@ikreymer
Copy link
Member Author

I guess only other question is, could we not go back to using done rather than introducing a new finished, since we're already handling it as an int or array or strings for backwards compatibility's sake? No strong opinion here, but it seems we could just undo the breaking change rather than renaming.

Yeah, I thought about that, technically could populate done from the pages list (it is supposed to be a full json dict rather than a single URL), but that requires either 1) reading and parsing pages.jsonl, and in case there's an issue (though there shouldn't be, have seen it if text is added) or 2) re-adding the done key in redis which would only be used for this.
But, that previous format still required that done be a JSON string but really we just need a single URL so this is more efficient, but perhaps later can switch to reading directly from pages.jsonl as well.

So overall, this is a better format, and chose a new key name to avoid confusion.

@ikreymer ikreymer merged commit 6d04c95 into main Mar 16, 2024
4 checks passed
@ikreymer ikreymer deleted the fix-save-load-state branch March 16, 2024 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crawl resumed from saved state revisits already done pages
2 participants