Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple extension-fields of the same type on the same record? #95

Open
acidus99 opened this issue Jan 8, 2024 · 2 comments
Open

Multiple extension-fields of the same type on the same record? #95

acidus99 opened this issue Jan 8, 2024 · 2 comments

Comments

@acidus99
Copy link

acidus99 commented Jan 8, 2024

I think this is something people know, but it is not explicitly stated: Can a record have multiple extension-fields of the same type?

Section 5.1 of the 1.1 spec says "WARC named fields of the same type shall not be repeated in the same WARC record (for example, a WARC record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g. WARC-Concurrent-To)." However it makes no explicit mention of whether multiple extension-fields of the same type are allowed. It does say "WARC processing software shall ignore fields with unrecognized names" which could mean it is allowed.

I think the answer is yes. But this is not stated anywhere. An example of multiple extension-fields of the same type on the same record that I've found so far is #42, the proposed WARC-Protocol field. That shows examples using 2 fields (for TLS and HTTP), but presumably at some point this will become a named field and have language in the spec like WARC-Concurrent-To does, leaving this question unanswered.

A reason to explicitly discuss multiple extension-fields of the same type is to avoid implementation issues. I suspect most WARC parsing software implements field parsing for extension-fields with a dictionary/hash, keyed on the field name, where duplicate keys are not allowed. Implementations will behave differently (first value wins, last field value wins, etc.). I personally hit this when parsing records with multiple WARC-Protocol fields.

Perhaps it should be explicitly stated, or does the " ignore fields with unrecognized names" cover this?

@JustAnotherArchivist
Copy link

Agreed, repeated extension-fields should definitely be allowed. You might argue that 'as noted' also applies to extensions. Of course, a parser that doesn't support a particular extension wouldn't know whether a field defined there allows repetitions, and so I do think the 'ignore unrecognised fields' clause sort of covers it. But it'd still be good to fix this in the core specification in my opinion.

The easiest resolution would naturally be to change the quoted paragraph of section 5.1 to talk about defined-fields rather than 'named fields'. But perhaps it's worth considering a renaming of the entire section 5 instead.

@ato ato added the warc-format label Jan 9, 2024
@ato
Copy link
Member

ato commented Jan 9, 2024

Yes. My interpretation and implementation is that "shall not be repeated ... except as noted" is setting up a default so that each named field doesn't need a statement disallowing repetition and the specification of individual fields can override this regardless of whether they're defined by the core format or an extension.

Given there's only one repeatable core field, I think the the fact it's worded broadly as "except as noted (e.g. WARC-Concurrent-To)" instead of "except WARC-Concurrent-To" supports this interpretation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants