-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC-Protocol field proposal #42
Comments
What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art. |
1 similar comment
What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art. |
I think that's a great idea. I've updated the proposal text to include a link to an issue template. |
h2c and h2 are obvious odd ones out in the list as they don't follow the general name/version form and h2 vs h2c is somewhat redundant with specifying the TLS version. I did it that way for consistency with the identifiers the RFC itself says to use in the HTTP Upgrade header and the ALPN protocol identifier field. Also I just made up the TLS protocol identifiers as I couldn't find anything semi-official. "SSLv3", "TLSv1.1" etc seems somewhat common in software though (Java, OpenSSL) so I can see an argument that might be a better choice. I don't think there's a right answer here, the slash form is better in the sense that you could consistently chop the version off. The "TLSvX" form is better in that you might not have to convert from whatever TLS library you're using says. I couldn't see one argument as particularly more compelling than the other so just picked one. |
After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way. In favour of a single field in the style of User-Agent:
In favour of repeated fields:
|
I have a question on which record types the
The most similar, already defined header I could think of to this is (Which is kind of weird if you think about it from an order-of-operations perspective. The IP address of the system must be known before the request is made, so it's odd that the convention is to include the It feels like the |
Here are some tools that do: wget, wpull, qwarc, Zeno, warcio (at least when using I think that they should be allowed on both request and response records. As for why you might want to record it on the request record: consider the case where you send a request but never receive a response. It is still worth recording this attempted request (and note the lack of a response in the log accompanying the crawl), including the relevant details like IP and protocol. |
I just realized I missed the 'revisit' record type in the WARC-Protocol proposal, so have edited it to be included. After this edit WARC-Protocol is allowed on the same record types as WARC-IP-Address (‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’). Some reasons for allowing it on multiple record types:
It's likely because:
|
Excellent, thanks for the context. I ended up including them on both |
Motivation:
For example it was proposed in WARC revision 1.1 (modification): support of HTTP 2.X protocol in WARC format. #15 and WARC Extensions for HTTP/2 proposal #41 to allow HTTP/2 messages to be represented as application/http.
WARC-Protocol field definition
The WARC-Protocol field denotes the protocol of the original network message
this record holds information about.
If the protocol you wish to record is not on the list above please file an issue to
propose a protocol identifier before using it.
The WARC-Protocol field may be omitted when the protocol is unknown or can be
unambiguosly determined from some combination of the scheme portion of the
WARC-Target-URI field, the Content-Type field and the message in the record
block itself.
Multiple WARC-Protocol fields may be present to indicate protocol layering. For
example HTTP/1.1 over TLS 1.0 would be indicated by:
The WARC-Protocol field does not indicate the format of the record block and
is not a replacement for the Content-Type field. Different protocols may
reuse the same media type. There are also situations where it may be
desirable to represent the same message of a particular protocol using
different types such as semantically equivalent text and binary forms.
The WARC-Protocol field may be used in 'request', 'response',
'resource', 'metadata' and 'revisit' records and shall not be used in 'warcinfo',
'conversion' and 'continuation' records.
Determining the protocol in the absence of WARC-Protocol
† Not a registered media type but has been used in the wild.
When the WARC-Protocol field is present it takes precedence over the rules in the table above.
Edit 2023-05-31: Added 'revisit' to list of allowed records.
Edit 2023-06-01: Added Gemini protocol as proposed by @acidus99 in #85.
Edit 2023-06-02: Added Gopher protocol as proposed by @TheTechRobo in #87.
The text was updated successfully, but these errors were encountered: