Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use existing CDXJ rather than indexing from WARCs #89

Merged
merged 2 commits into from Mar 7, 2024

Conversation

tw4l
Copy link
Collaborator

@tw4l tw4l commented Mar 7, 2024

Fixes #88

All tests and linting are passing, please let me know if you'd like to see any changes!

Since not indexing from WARCs means losing another way to detect pages, the new --cdxj option must be used in combination with --pages, and I've added a validator to fail early if this is not the case.

bin/cli.js Outdated

const ext = cdxjFile.split('.').pop()
if (!allowedExts.includes(ext)) {
log.info(`CDXJ: Skipping file ${cdxjFile}, not a CDXJ file`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tw4l Nitpick: I'd make that a warning maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, just pushed the change :)

Copy link
Collaborator

@matteocargnelutti matteocargnelutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perfect @tw4l - Thank you very much for a great PR.

I can merge and publish whenever you're ready :)

@tw4l
Copy link
Collaborator Author

tw4l commented Mar 7, 2024

This is perfect @tw4l - Thank you very much for a great PR.

I can merge and publish whenever you're ready :)

Thanks so much @matteocargnelutti ! Should be ready now :)

@matteocargnelutti matteocargnelutti merged commit 3201423 into harvard-lil:main Mar 7, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to use existing CDXJ indices rather than indexing from WARCs
2 participants