Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC-Protocol field proposal #42

Open
ato opened this issue Jul 13, 2018 · 9 comments
Open

WARC-Protocol field proposal #42

ato opened this issue Jul 13, 2018 · 9 comments

Comments

@ato
Copy link
Member

ato commented Jul 13, 2018

Motivation:

  • To allow the recording of messages using a different representation to their wire message format as
  • To allow the presence of layered protocols like TLS to be recorded.
  • To allow readers of WARC files to be able to determine the protocol of a message without having to know how to parse the record block.
  • To disambiguate when the protocol cannot be determined from the message itself. Many protocols, including HTTP/2 and SPDY, negotiate protocol version up front and subsequent messages are not tagged with a protocol identifier.

WARC-Protocol field definition

The WARC-Protocol field denotes the protocol of the original network message
this record holds information about.

WARC-Protocol = "WARC-Protocol" ":" protocol-id
protocol-id = "dns"      ; DNS [RFC 1035]
            | "ftp"      ; FTP [RFC 959]
            | "gemini"   ; Gemini
            | "gopher"   ; Gopher [RFC 1436]
            | "http/0.9" ; HTTP/0.9
            | "http/1.0" ; HTTP/1.0 [RFC 1945]
            | "http/1.1" ; HTTP/1.1 [RFC 7230]
            | "h2"       ; HTTP/2 over TLS [RFC 7540]
            | "h2c"      ; HTTP/2 over cleartext TCP [RFC 7540]
            | "spdy/1"   ; SPDY/1
            | "spdy/2"   ; SPDY/2
            | "spdy/3"   ; SPDY/3
            | "ssl/2"    ; SSLv2 aka SSL 0.2
            | "ssl/3"    ; SSLv3 aka SSL 3.0 [RFC 6101]
            | "tls/1.0"  ; TLS 1.0 [RFC 2246]
            | "tls/1.1"  ; TLS 1.1 [RFC 4336]
            | "tls/1.2"  ; TLS 1.2 [RFC 5246]
            | "tls/1.3"  ; TLS 1.3

If the protocol you wish to record is not on the list above please file an issue to
propose a protocol identifier before using it.

The WARC-Protocol field may be omitted when the protocol is unknown or can be
unambiguosly determined from some combination of the scheme portion of the
WARC-Target-URI field, the Content-Type field and the message in the record
block itself.

Multiple WARC-Protocol fields may be present to indicate protocol layering. For
example HTTP/1.1 over TLS 1.0 would be indicated by:

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.0

The WARC-Protocol field does not indicate the format of the record block and
is not a replacement for the Content-Type field. Different protocols may
reuse the same media type. There are also situations where it may be
desirable to represent the same message of a particular protocol using
different types such as semantically equivalent text and binary forms.

The WARC-Protocol field may be used in 'request', 'response',
'resource', 'metadata' and 'revisit' records and shall not be used in 'warcinfo',
'conversion' and 'continuation' records.

Determining the protocol in the absence of WARC-Protocol

URI Scheme Content-Type Header version Protocol
dns text/dns dns ; transport unknown
ftp ftp ; over cleartext TCP
gemini application/gemini † gemini ; over TLS #85
gopher application/gopher † gopher ; over cleartext TCP
http application/http absent http/0.9 ; over cleartext TCP
http application/http "HTTP/1.0" http/1.0 ; over cleartext TCP
http application/http "HTTP/1.1" http/1.1 ; over cleartext TCP
https application/http "HTTP/1.0" http/1.0 ; over TLS
https application/http "HTTP/1.1" http/1.1 ; over TLS

† Not a registered media type but has been used in the wild.

When the WARC-Protocol field is present it takes precedence over the rules in the table above.

Edit 2023-05-31: Added 'revisit' to list of allowed records.
Edit 2023-06-01: Added Gemini protocol as proposed by @acidus99 in #85.
Edit 2023-06-02: Added Gopher protocol as proposed by @TheTechRobo in #87.

@nlevitt
Copy link
Member

nlevitt commented Jul 16, 2018

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

1 similar comment
@nlevitt
Copy link
Member

nlevitt commented Jul 16, 2018

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

@ato
Copy link
Member Author

ato commented Jul 17, 2018

Maybe we could say, please file a github issue here to propose a new protocol id, before you use it.

I think that's a great idea. I've updated the proposal text to include a link to an issue template.

@ato
Copy link
Member Author

ato commented Jul 17, 2018

h2c and h2 are obvious odd ones out in the list as they don't follow the general name/version form and h2 vs h2c is somewhat redundant with specifying the TLS version. I did it that way for consistency with the identifiers the RFC itself says to use in the HTTP Upgrade header and the ALPN protocol identifier field.

Also I just made up the TLS protocol identifiers as I couldn't find anything semi-official. "SSLv3", "TLSv1.1" etc seems somewhat common in software though (Java, OpenSSL) so I can see an argument that might be a better choice. I don't think there's a right answer here, the slash form is better in the sense that you could consistently chop the version off. The "TLSvX" form is better in that you might not have to convert from whatever TLS library you're using says. I couldn't see one argument as particularly more compelling than the other so just picked one.

@ato
Copy link
Member Author

ato commented Mar 6, 2019

After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way.

In favour of a single field in the style of User-Agent:

  • It makes WARC fields easier to deal with in most programming languages as you can just dump them into a hash table (with the exception of WARC-Concurrent-To).
  • I like the idea of using a consistent mini-language across all three headers (User-Agent, WARC-Software-Version, WARC-Protocol) to specifying component version numbers. It also leads to the obvious extension of allowing comments with more details for diagnostic/troubleshooting purposes.
  • It's more concise which makes records more human readable.

In favour of repeated fields:

  • It doesn't require field-specific parsing.
  • WARC does allow specific fields to be repeated so that's something readers have to account for anyway.
  • It's simpler to write a matching expression for generic filtering tools.

@acidus99
Copy link

acidus99 commented May 30, 2023

I have a question on which record types the WARC-Protocol header, as well as the WARC-TLS-Cipher-Suite header mentioned/proposed by @ato here should appear.

  • Both a request and a response can travel on top of a TLS connection, so presumably these headers could appear on both the request and response records. But should they?
  • A client cannot change the TLS version of cipher suite between a request and a response, so the header values would be identical for request/response record pairs. Including it on both seems like needless duplication, especially if the records are linked with a WARC-Concurrent-To.

The most similar, already defined header I could think of to this is WARC-IP-Address. Section 5.10 of the 1.1 spec says "the numeric Internet address contacted to retrieve any included content" and can be associated with request and response records. But all the examples in the spec only show the WARC-IP-Address header on response records, and I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

(Which is kind of weird if you think about it from an order-of-operations perspective. The IP address of the system must be known before the request is made, so it's odd that the convention is to include the WARC-IP-Address header on response instead of the request.)

It feels like the WARC-Protocol and WARC-TLS-Cipher-Suite headers should go where the WARC-IP-Address header goes, but I really am curious to the community's feedback.

@JustAnotherArchivist
Copy link

I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

Here are some tools that do: wget, wpull, qwarc, Zeno, warcio (at least when using warcio.capture_http). I'm sure there are more. Heritrix and warcprox don't. If you want some real-world example WARCs, the ArchiveTeam collection on the Internet Archive is full of them.

I think that they should be allowed on both request and response records. As for why you might want to record it on the request record: consider the case where you send a request but never receive a response. It is still worth recording this attempted request (and note the lack of a response in the log accompanying the crawl), including the relevant details like IP and protocol.

@ato
Copy link
Member Author

ato commented May 31, 2023

I just realized I missed the 'revisit' record type in the WARC-Protocol proposal, so have edited it to be included. After this edit WARC-Protocol is allowed on the same record types as WARC-IP-Address (‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’).

Some reasons for allowing it on multiple record types:

  • In some cases the request and response may use different protocol versions. (e.g. http/1.0 vs http/1.1)
  • You may have information about the protocol that was used but not have the actual request or response message. This can occur for example when converting to WARC from another format or due to tool limitations (e.g. in-browser archiving).

it's odd that the convention is to include the WARC-IP-Address header on response instead of the request

It's likely because:

  1. The older ARC file format did not store the request but did store the IP address.
  2. Before the advent of browser-based crawling, request records were usually completely ignored and not indexed for replay. So if you're going to put it in just one record then choosing the response record would make it more easily accessible to replay tools.

@acidus99
Copy link

Excellent, thanks for the context. I ended up including them on both request and response records

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants