WARC-Protocol field proposal #42

ato · 2018-07-13T03:46:13Z

Motivation:

To allow the recording of messages using a different representation to their wire message format as
- the write protocol may be suboptimal for the purposes of storage and replay; or
- the raw bytes of the wire protocol may be unavailable
  For example it was proposed in WARC revision 1.1 (modification): support of HTTP 2.X protocol in WARC format. #15 and WARC Extensions for HTTP/2 proposal #41 to allow HTTP/2 messages to be represented as application/http.
To allow the presence of layered protocols like TLS to be recorded.
- The URI scheme is an unreliable indicator of the presence of TLS due to opportunistic TLS (which is also now defined for HTTP/2)
To allow readers of WARC files to be able to determine the protocol of a message without having to know how to parse the record block.
To disambiguate when the protocol cannot be determined from the message itself. Many protocols, including HTTP/2 and SPDY, negotiate protocol version up front and subsequent messages are not tagged with a protocol identifier.

WARC-Protocol field definition

The WARC-Protocol field denotes the protocol of the original network message
this record holds information about.

WARC-Protocol = "WARC-Protocol" ":" protocol-id
protocol-id = "dns"      ; DNS [RFC 1035]
            | "ftp"      ; FTP [RFC 959]
            | "gemini"   ; Gemini
            | "gopher"   ; Gopher [RFC 1436]
            | "http/0.9" ; HTTP/0.9
            | "http/1.0" ; HTTP/1.0 [RFC 1945]
            | "http/1.1" ; HTTP/1.1 [RFC 7230]
            | "h2"       ; HTTP/2 over TLS [RFC 7540]
            | "h2c"      ; HTTP/2 over cleartext TCP [RFC 7540]
            | "spdy/1"   ; SPDY/1
            | "spdy/2"   ; SPDY/2
            | "spdy/3"   ; SPDY/3
            | "ssl/2"    ; SSLv2 aka SSL 0.2
            | "ssl/3"    ; SSLv3 aka SSL 3.0 [RFC 6101]
            | "tls/1.0"  ; TLS 1.0 [RFC 2246]
            | "tls/1.1"  ; TLS 1.1 [RFC 4336]
            | "tls/1.2"  ; TLS 1.2 [RFC 5246]
            | "tls/1.3"  ; TLS 1.3

If the protocol you wish to record is not on the list above please file an issue to
propose a protocol identifier before using it.

The WARC-Protocol field may be omitted when the protocol is unknown or can be
unambiguosly determined from some combination of the scheme portion of the
WARC-Target-URI field, the Content-Type field and the message in the record
block itself.

Multiple WARC-Protocol fields may be present to indicate protocol layering. For
example HTTP/1.1 over TLS 1.0 would be indicated by:

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.0

The WARC-Protocol field does not indicate the format of the record block and
is not a replacement for the Content-Type field. Different protocols may
reuse the same media type. There are also situations where it may be
desirable to represent the same message of a particular protocol using
different types such as semantically equivalent text and binary forms.

The WARC-Protocol field may be used in 'request', 'response',
'resource', 'metadata' and 'revisit' records and shall not be used in 'warcinfo',
'conversion' and 'continuation' records.

Determining the protocol in the absence of WARC-Protocol

URI Scheme	Content-Type	Header version	Protocol
dns	text/dns		dns ; transport unknown
ftp			ftp ; over cleartext TCP
gemini	application/gemini †		gemini ; over TLS #85
gopher	application/gopher †		gopher ; over cleartext TCP
http	application/http	absent	http/0.9 ; over cleartext TCP
http	application/http	"HTTP/1.0"	http/1.0 ; over cleartext TCP
http	application/http	"HTTP/1.1"	http/1.1 ; over cleartext TCP
https	application/http	"HTTP/1.0"	http/1.0 ; over TLS
https	application/http	"HTTP/1.1"	http/1.1 ; over TLS

† Not a registered media type but has been used in the wild.

When the WARC-Protocol field is present it takes precedence over the rules in the table above.

Edit 2023-05-31: Added 'revisit' to list of allowed records.
Edit 2023-06-01: Added Gemini protocol as proposed by @acidus99 in #85.
Edit 2023-06-02: Added Gopher protocol as proposed by @TheTechRobo in #87.

The text was updated successfully, but these errors were encountered:

nlevitt · 2018-07-16T18:05:51Z

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

nlevitt · 2018-07-16T18:18:13Z

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

ato · 2018-07-17T01:13:38Z

Maybe we could say, please file a github issue here to propose a new protocol id, before you use it.

I think that's a great idea. I've updated the proposal text to include a link to an issue template.

ato · 2018-07-17T02:01:57Z

h2c and h2 are obvious odd ones out in the list as they don't follow the general name/version form and h2 vs h2c is somewhat redundant with specifying the TLS version. I did it that way for consistency with the identifiers the RFC itself says to use in the HTTP Upgrade header and the ALPN protocol identifier field.

Also I just made up the TLS protocol identifiers as I couldn't find anything semi-official. "SSLv3", "TLSv1.1" etc seems somewhat common in software though (Java, OpenSSL) so I can see an argument that might be a better choice. I don't think there's a right answer here, the slash form is better in the sense that you could consistently chop the version off. The "TLSvX" form is better in that you might not have to convert from whatever TLS library you're using says. I couldn't see one argument as particularly more compelling than the other so just picked one.

ato · 2019-03-06T00:48:15Z

After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way.

In favour of a single field in the style of User-Agent:

It makes WARC fields easier to deal with in most programming languages as you can just dump them into a hash table (with the exception of WARC-Concurrent-To).
I like the idea of using a consistent mini-language across all three headers (User-Agent, WARC-Software-Version, WARC-Protocol) to specifying component version numbers. It also leads to the obvious extension of allowing comments with more details for diagnostic/troubleshooting purposes.
It's more concise which makes records more human readable.

In favour of repeated fields:

It doesn't require field-specific parsing.
WARC does allow specific fields to be repeated so that's something readers have to account for anyway.
It's simpler to write a matching expression for generic filtering tools.

acidus99 · 2023-05-30T21:31:54Z

I have a question on which record types the WARC-Protocol header, as well as the WARC-TLS-Cipher-Suite header mentioned/proposed by @ato here should appear.

Both a request and a response can travel on top of a TLS connection, so presumably these headers could appear on both the request and response records. But should they?
A client cannot change the TLS version of cipher suite between a request and a response, so the header values would be identical for request/response record pairs. Including it on both seems like needless duplication, especially if the records are linked with a WARC-Concurrent-To.

The most similar, already defined header I could think of to this is WARC-IP-Address. Section 5.10 of the 1.1 spec says "the numeric Internet address contacted to retrieve any included content" and can be associated with request and response records. But all the examples in the spec only show the WARC-IP-Address header on response records, and I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

(Which is kind of weird if you think about it from an order-of-operations perspective. The IP address of the system must be known before the request is made, so it's odd that the convention is to include the WARC-IP-Address header on response instead of the request.)

It feels like the WARC-Protocol and WARC-TLS-Cipher-Suite headers should go where the WARC-IP-Address header goes, but I really am curious to the community's feedback.

JustAnotherArchivist · 2023-05-30T23:59:18Z

I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

Here are some tools that do: wget, wpull, qwarc, Zeno, warcio (at least when using warcio.capture_http). I'm sure there are more. Heritrix and warcprox don't. If you want some real-world example WARCs, the ArchiveTeam collection on the Internet Archive is full of them.

I think that they should be allowed on both request and response records. As for why you might want to record it on the request record: consider the case where you send a request but never receive a response. It is still worth recording this attempted request (and note the lack of a response in the log accompanying the crawl), including the relevant details like IP and protocol.

ato · 2023-05-31T01:51:03Z

I just realized I missed the 'revisit' record type in the WARC-Protocol proposal, so have edited it to be included. After this edit WARC-Protocol is allowed on the same record types as WARC-IP-Address (‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’).

Some reasons for allowing it on multiple record types:

In some cases the request and response may use different protocol versions. (e.g. http/1.0 vs http/1.1)
You may have information about the protocol that was used but not have the actual request or response message. This can occur for example when converting to WARC from another format or due to tool limitations (e.g. in-browser archiving).

it's odd that the convention is to include the WARC-IP-Address header on response instead of the request

It's likely because:

The older ARC file format did not store the request but did store the IP address.
Before the advent of browser-based crawling, request records were usually completely ignored and not indexed for replay. So if you're going to put it in just one record then choosing the response record would make it more easily accessible to replay tools.

acidus99 · 2023-05-31T13:13:48Z

Excellent, thanks for the context. I ended up including them on both request and response records

ato added the proposal label Jul 13, 2018

ato mentioned this issue Jul 13, 2018

WARC Extensions for HTTP/2 proposal #41

Closed

ato added the http/2 label Jul 13, 2018

ato added the warc-format label Nov 18, 2018

ato mentioned this issue Jan 2, 2020

Record SSL certificates? internetarchive/warcprox#13

Open

This was referenced Oct 4, 2020

Do not use "http/2" protocol version in HTTP headers in WARC files commoncrawl/news-crawl#42

Open

[WARC] Backward compatible storage of HTTP/2 headers apache/incubator-stormcrawler#828

Closed

sebastian-nagel mentioned this issue Nov 29, 2022

[WARC] Backward compatible storage of HTTP/2 headers apache/incubator-stormcrawler#1010

Merged

This was referenced Jun 2, 2023

Gemini as a protocol for the WARC-Protocol field #85

Closed

WARC-Protocol addition proposal: Gopher #87

Closed

Arkiver2 mentioned this issue Dec 15, 2023

WARC-Cipher-Suite field proposal #86

Open

acidus99 mentioned this issue Jan 8, 2024

Multiple extension-fields of the same type on the same record? #95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARC-Protocol field proposal #42

WARC-Protocol field proposal #42

ato commented Jul 13, 2018 •

edited

nlevitt commented Jul 16, 2018

nlevitt commented Jul 16, 2018

ato commented Jul 17, 2018

ato commented Jul 17, 2018

ato commented Mar 6, 2019 •

edited

acidus99 commented May 30, 2023 •

edited

JustAnotherArchivist commented May 30, 2023

ato commented May 31, 2023 •

edited

acidus99 commented May 31, 2023

WARC-Protocol field proposal #42

WARC-Protocol field proposal #42

Comments

ato commented Jul 13, 2018 • edited

WARC-Protocol field definition

Determining the protocol in the absence of WARC-Protocol

nlevitt commented Jul 16, 2018

nlevitt commented Jul 16, 2018

ato commented Jul 17, 2018

ato commented Jul 17, 2018

ato commented Mar 6, 2019 • edited

acidus99 commented May 30, 2023 • edited

JustAnotherArchivist commented May 30, 2023

ato commented May 31, 2023 • edited

acidus99 commented May 31, 2023

ato commented Jul 13, 2018 •

edited

ato commented Mar 6, 2019 •

edited

acidus99 commented May 30, 2023 •

edited

ato commented May 31, 2023 •

edited