Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify WARC writing + CDXJ indexing into single class #507

Merged
merged 5 commits into from Mar 26, 2024

Conversation

ikreymer
Copy link
Member

Previously, we had the main WARCWriter as well as utility WARCResourceWriter that was used for screenshots, text, pageinfo and only generated resource records. This separate WARC writing path did not generate CDX, and used appendFile() to append new WARC records to an existing WARC.

This PR refactors into a single WARCWriter, which uses a writable stream to append records, and can also generate CDX on the fly. This PR is a pre-requisite to the js-wacz conversion (#484) since all WARCs need to have generated CDX.

However, as this is a big enough change, it made sense to refactor into its own PR for now.

unify warc-writing into single WARCWriter class to support cdx indexing for all records
create dedicated writers for screenshots and text
@ikreymer ikreymer requested a review from tw4l March 23, 2024 04:58
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and working well, just a few comments/questions

src/util/warcwriter.ts Show resolved Hide resolved
src/crawler.ts Show resolved Hide resolved
src/crawler.ts Show resolved Hide resolved
src/crawler.ts Outdated Show resolved Hide resolved
src/util/worker.ts Show resolved Hide resolved
ikreymer and others added 2 commits March 26, 2024 14:23
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
@ikreymer ikreymer merged commit 0ad10a8 into main Mar 26, 2024
4 checks passed
@ikreymer ikreymer deleted the unify-warc-writer branch March 26, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants