Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify --pages option to copy pages files directly into WACZ #92

Merged
merged 6 commits into from Mar 22, 2024

Conversation

tw4l
Copy link
Collaborator

@tw4l tw4l commented Mar 20, 2024

Fixes #91

Adds tests as well. Happy to make any changes you see fit. Thanks for the review!

pages.jsonl and extraPages.jsonl files will be copied, other files
are ignored.
@matteocargnelutti
Copy link
Collaborator

Hi @tw4l -- thanks for the PR!

I'm on board with the idea 🏄

Here are two things I would suggest:

  • --pages-dir should be a replacement of --pages, so we don't have two features behaving in almost identical ways. I'd suggest replacing the current --pages option with your feature.
  • I would add some data validation in copyPagesFilesToZip() so we don't accidentally copy over an invalid pages.jsonl / extraPages.jsonl file. For example, I'd want to be "sure" it's a series of JSON objects, and that they match the spec.

What do you think?

@tw4l
Copy link
Collaborator Author

tw4l commented Mar 20, 2024

Hi @tw4l -- thanks for the PR!

I'm on board with the idea 🏄

Here are two things I would suggest:

* `--pages-dir` should be a replacement of `--pages`, so we don't have two features behaving in almost identical ways. I'd suggest replacing the current `--pages` option with your feature.

* I would add some data validation in `copyPagesFilesToZip()` so we don't accidentally copy over an invalid `pages.jsonl` / `extraPages.jsonl` file. For example, I'd want to be "sure" it's a series of JSON objects, and that they match the spec.

What do you think?

Those both make sense to me! Happy to push these changes tomorrow morning :)

- Replace existing -p/--pages implementation rather than adding
another option
- Rather than hardcoding allowed names, check that JSONL files
passed have correct extension and are well-formed JSON lines
- Modify tests and fixtures to account for new logic
@tw4l
Copy link
Collaborator Author

tw4l commented Mar 21, 2024

Good morning @matteocargnelutti !

I've gone ahead and made the changes, as well as checking for a .jsonl extension (case-insensitive) rather than checking for pages.jsonl and extraPages.jsonl specifically, to allow for future flexibility with pages filenames.

Let me know if anything!

@tw4l tw4l changed the title Add --pagesDir option to copy pages files directly into WACZ Modify --pages option to copy pages files directly into WACZ Mar 21, 2024
index.js Outdated
const rl = readline.createInterface({ input: createReadStream(pagesFile) })
for await (const line of rl) {
try {
JSON.parse(line)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the edits @tw4l!

I think we're almost there - We should maybe we add a little bit of validation against the spec, just to make sure those are indeed pages files.

We could check:

  • That the first item contains format and id properties
  • That subsequent entries contain url and ts properties

Maybe using Node's assert since we're in a try / catch ?

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah nice suggestion! May as well get this right while we're focused on it :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit pushed!

Copy link
Collaborator Author

@tw4l tw4l Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ilya raised a related point: in older versions of the crawler, the pages files occasionally included invalid lines, we think because of text extraction that wasn't truncated.

It may be safer if less performant to filter per-line rather than per-file, but write valid lines as-is into the correct file in the WACZ. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now inclined to say that we can generate valid pages.jsonl and just fail if its invalid, but yeah, another option would be, on first failure, to skip the failing lines and reserialize only valid ones (similar to old behavior). But not sure if its needed at this point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either approach works for me; in both cases, we are unlikely to end up with invalid pages.jsonl files added to the archive. I am also not concerned about performance for this step.

Let me know if you'd like to add line-by-line filtering or not 😄 . Otherwise: this is great and I'm happy to test / approve / merge.

Thank you both!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we're going to stick with per-file and just make sure the crawler isn't writing any invalid pages files to begin with, so feel free to test/approve/merge, thank you!

Co-authored-by: Matteo Cargnelutti <matteo.cargnelutti@gmail.com>
@matteocargnelutti matteocargnelutti merged commit ec3bac4 into harvard-lil:main Mar 22, 2024
2 checks passed
@matteocargnelutti
Copy link
Collaborator

Thanks again for a great PR, @tw4l

@tw4l tw4l deleted the issue-91-copy-pages-files branch March 22, 2024 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to copy existing pages.jsonl/extraPages.jsonl files directly in WACZ
3 participants