Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using JS WACZ #505

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from
Draft

Switch to using JS WACZ #505

wants to merge 21 commits into from

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Mar 22, 2024

Replaces dependencies on py-wacz with importing js-wacz natively.
Writes pages to either pages.jsonl (if seed) or extraPages.jsonl (if non-seed)
Uses streams for writing pages
Replaces --generateCDX with just moving tmp-cdx -> indexes
Removes any dependencies on python

Fixes #484

Pending more testing and js-wacz release, using @tw4l branch for now!

src/crawler.ts Outdated Show resolved Hide resolved
@tw4l
Copy link
Contributor

tw4l commented Mar 22, 2024

Also noticing that js-wacz is logging strings to stdout, which breaks our logging format. Might want to see what we can do about that. I suppose if we call it as a subprocess via the cli we could capture the stdout and write it into the details of a crawler log line...

src/crawler.ts Outdated Show resolved Hide resolved
@tw4l
Copy link
Contributor

tw4l commented Mar 22, 2024

TODO:

  • Add WACZ validation (not yet supported in js-wacz)
  • Make CDXJ handling more memory-efficient in js-wacz (currently keeps all pages in memory, may OOM with large crawls)
  • Possibly move CDXJ line handling in js-wacz from bin/cli.js into WACZ class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use js-wacz to create WACZ files
2 participants