Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encapsulation, access control, and fixity #14

Open
jcahill opened this issue Aug 14, 2020 · 5 comments
Open

Encapsulation, access control, and fixity #14

jcahill opened this issue Aug 14, 2020 · 5 comments

Comments

@jcahill
Copy link
Contributor

jcahill commented Aug 14, 2020

One issue that seems to arise from the current draft spec is loss of separation of concerns with respect to accessing and modifying components of the collection. This speaks to both (a) the need for different parties to have differing levels of access to distinct materials and (b) the need to be confident that the underlying capture data has not changed and is being given a wide berth.

Some scenarios that come to mind:

  1. Haphazard in-place modification of the record leads to container integrity issues.
  2. Access controls require deriving a wacz_new from a subset of wacz_orig.
  3. Certain groups are only ever interested in select subsets of the data. They need it in bulk, so they need raw download. But everything else is dead weight.
  4. Updating of wacz containers obscures a fixity issue with the records.

Some of these issues could be solvable with some scoping of when exactly the encapsulation is expected to occur in relation to content changes. If the wacz spec is to be seen more as a sort of collection layout convention than an archive file format, compression could itself remain optional, only needing to come into play as a storage-mode consideration, i.e. when collections aren't in a state of heavy development. BagIt's evolution comes to mind. Wikipedia:

Until version 15, the draft also described how to serialize a bag in an archive file, such as ZIP or TAR. From version 15 on, the serialization is no longer part of the specifications, but not because of technical reasons but only because of the scope and focus of the specification.

The outer zip container is effectively a glorified suitcase for the data and metadata here (wacz draft), so it stands to reason that it might not always be strictly necessary. The hierarchy's hammering down of certain conventions for pairing of web archival data files and their sidecar metadata files strikes me as much more important.

The most important question for me, then, lies in how to effectively reason about contents already in wacz hierarchies, especially for the purposes of aggregating and disaggregating them.

@ato
Copy link

ato commented Aug 15, 2020

While standardizing the hierarchy by itself may be interesting for other use cases, in order to achieve the two goals that motivated the creation of WACZ the details of the encapsulation are essential. It needs to be a single file so it can easily be shared easily and that single file needs to be constructed carefully, not just any generic container format, in order to allow incremental loading without downloading/reading the entire collection.

@ato
Copy link

ato commented Aug 15, 2020

Ah, I think I misunderstood you. You're just saying you'd like to see versioning and fixity as features and suggesting that BagIt or OCFL could be added as structural layers to provide those features. I think I was confused because you mentioned how BagIt eliminated the specification about ZIP but for WACZ the details around ZIP are actually essential to achieving its goals and so cannot be eliminated.

Edit: I confused atomotic and jcahill as the same person. My bad!

@ikreymer
Copy link
Member

other container formats satisfy this need like bagit or the newest ocfl.

so why not keep wacz format as simple as possible and relative only to the webarchiving domain and organize collections of wacz inside ocfl?

My impression is that OCFL is especially designed specifically around the need to store multiple versions of data and their digests.
But that doesn't apply to WARC files, since there's never going to be a 'v2' of the same WARC file.

I suppose using Bagit may be a better fit, but that wouldn't address the random-access requirement, for which the Zip bundling is still necessary..

@ikreymer
Copy link
Member

Maybe there should be a separate WAC directory layout, and the Z part for packing it up as a single Zip file..

But, are users going to open the expanded file, or just use it as sort of a black box, eg. the way a .docx files generally are?

I suppose maybe that could be useful if a collection is being actively edited, though its not designed as an edit-in-place format..

@atomotic
Copy link

sorry, i have a precarious connection in train i mistakenly deleted the previous comment.

got the point, ocfl design is not useful here. Bagit instead, could be zipped uncompressed.
the bagit package (golang) of https://github.com/ndlib/bendo does this as example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants