Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Spec: Nested WACZ files? #129

Open
4 tasks
ikreymer opened this issue Oct 31, 2022 · 3 comments
Open
4 tasks

New Spec: Nested WACZ files? #129

ikreymer opened this issue Oct 31, 2022 · 3 comments
Assignees

Comments

@ikreymer
Copy link
Member

Related to multi-WACZ / aggregated WACZ loading #112, a possible idea is to support nested WACZ files, eg. ZIP files containing other WACZ files, and a datapackage.json.
The main use case for this would be parallel crawlers which produce multiple WACZ files which are signed individually. For packaging / distribution, it is still convenient to bundle the output into a single file. This makes sense if the reason for having multiple WACZ output is parallelism, and not size limits. Some questions to answer around this:

  • Should this be supported
  • What metadata is needed at the top, just datapackage.json and a list of WACZ files?
  • Should infinite WACZ nesting be supported, or only one level?
  • What is the signing schema for nested WACZ, eg. datapackage-digest.json that signs the WACZ of WACZs?

An alternative would be to simply merging WACZ files, merging the CDXJ, page lists, etc.., which is also doable, but more work (both to implement and to run).

@ikreymer
Copy link
Member Author

Clarifying a bit more, there are two key reasons for resulting in multiple WACZ files:
A) A parallel crawl with multiple crawls each producing own WACZ file, with each one having a subset of pages.
B) A single crawl that reaches a certain data size limit, where adding to one file is no longer desirable (eg. perhaps over 100GB?)

The solution for these are as follows:

  1. Combine smaller WACZ files into a single one by merging the .cdxj and creating a new WACZ files with all the WARCs
  2. Combining smaller WACZ files into a new 'nested' WACZ, as described above.
  3. Creating a JSON manifest of multiple WACZ files, as discussed in WACZ Aggregation / Multi WACZ Specification #112

Options 1) and 2) are good solutions for reason A - where multiple WACZ files exist due to parallel crawling, and can be quite small.
However, option 3) may be the best option for reason B - where multiple WACZ files exist because the size of each one is already quite large.
Probably we will need the JSON manifest 3) and either 1) or 2) as well, unless we decide to only support JSON manifest.

@edsu
Copy link
Collaborator

edsu commented Jan 9, 2023

@ikreymer for nesting would we need a new file name and extension for nested WACZ files that is distinct from WACZ? If not won't WACZ viewers need to account for whether the WACZ was nested or not and behave accordingly?

If we want to consider nesting as part of WACZ I think this would mean updating the WACZ specification to include this nesting functionality directly, or at least pointing to the separate WACZ Aggregation specification?

@edsu
Copy link
Collaborator

edsu commented Jan 9, 2023

In the use case above where each WACZ is individually signed, is the issue that the cert that is being used to sign each WACZ needs to be different? Or is it simply a technical convenience to get around CDXJ merging? Or are there other issues at play?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants