Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use js-wacz to create WACZ files #484

Open
tw4l opened this issue Mar 5, 2024 · 1 comment · May be fixed by #505
Open

Use js-wacz to create WACZ files #484

tw4l opened this issue Mar 5, 2024 · 1 comment · May be fixed by #505
Assignees

Comments

@tw4l
Copy link
Contributor

tw4l commented Mar 5, 2024

Improvements for 1.0.0 branch of crawler:

  • Switch from using py-wacz to js-wacz for WACZ generation
  • Pass in indexes from /tmp-cdx rather than reindexing from WARCS
  • Support creating indices with --generateCDX from temp-cdx/ rather than having to reindex from the WARCs
  • Delete /tmp-cdx after no longer needed
@tw4l
Copy link
Contributor Author

tw4l commented Mar 7, 2024

Related js-wacz PR: harvard-lil/js-wacz#89

ikreymer added a commit that referenced this issue Mar 26, 2024
Previously, there was the main WARCWriter as well as utility
WARCResourceWriter that was used for screenshots, text, pageinfo and
only generated resource records. This separate WARC writing path did not
generate CDX, but used appendFile() to append new WARC records to an
existing WARC.

This change removes WARCResourceWriter and ensures all WARC writing is done through a single WARCWriter, which uses a writable stream to append records, and can also generate CDX on the fly. This change is a
pre-requisite to the js-wacz conversion (#484) since all WARCs need to
have generated CDX.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Implementing
Development

Successfully merging a pull request may close this issue.

1 participant