Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient definition of quoted-string #71

Open
JustAnotherArchivist opened this issue Jul 4, 2021 · 5 comments
Open

Insufficient definition of quoted-string #71

JustAnotherArchivist opened this issue Jul 4, 2021 · 5 comments

Comments

@JustAnotherArchivist
Copy link

First, these are the relevant definitions from WARC/1.1:

quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext        = <any TEXT except <">>
quoted-pair   = "\" CHAR ; single-character quoting
TEXT          = <any OCTET except CTLs,
                but including LWS>
CHAR          = <UTF-8 characters; RFC3629>  ; (0-191, 194-244)

The issue here is that the backslash is not excluded in qdtext. This means that any quoted-string value involving backslashes is ambiguous:

  • "\\" could mean either two backslashes (via qdtext) or one (via quoted-pair).
  • "\a" could mean \a or just a.
  • "\" could be a single backslash or the beginning of a quoted-string with the first character being a double quote; this could lead to all sorts of parser errors later on.

I propose to redefine qdtext as:

qdtext        = <any TEXT except <"> and "\">
@JustAnotherArchivist
Copy link
Author

On a related note, I think the specification should mention that the value of a quoted-string is its decoded contents, i.e. the concatenation of the qdtext and the quoted-pairs with the escaping backslashes removed. (At least I assume that's the intended meaning.) At the moment, it is simply used in the definitions of field-content and WARC-Filename with no further note of the meaning.

@ato
Copy link
Member

ato commented Jul 4, 2021

It appears those definitions were copied verbatim from RFC 2616 except that WARC defines CHAR as a UTF-8 character instead of an ASCII octet.

The newer RFC 7230 indeed excludes \ (0x5C) and " (0x22) from qdtext and also has a little more explanation of how a quoted-string should be interpreted:

A string of text is parsed as a single value if it is quoted using
double-quote marks.

 quoted-string  = DQUOTE *( qdtext / quoted-pair ) DQUOTE
 qdtext         = HTAB / SP /%x21 / %x23-5B / %x5D-7E / obs-text
 obs-text       = %x80-FF

The backslash octet ("\") can be used as a single-octet quoting
mechanism within quoted-string and comment constructs. Recipients
that process the value of a quoted-string MUST handle a quoted-pair
as if it were replaced by the octet following the backslash.

 quoted-pair    = "\" ( HTAB / SP / VCHAR / obs-text )

A sender SHOULD NOT generate a quoted-pair in a quoted-string except
where necessary to quote DQUOTE and backslash octets occurring within
that string. A sender SHOULD NOT generate a quoted-pair in a comment
except where necessary to quote parentheses ["(" and ")"] and
backslash octets occurring within that comment.

It also defines other important details that WARC does not such as that leading and trailing whitespace in field values is removed during parsing and that one of the reasons to use quoted-string is when it's necessary to preserve it. Presumably the reason it's used in WARC-Filename is so that filenames that start or end with whitespace or that contain CTLs can be preserved.

@JustAnotherArchivist
Copy link
Author

By the way, I realise that these definitions were taken directly from RFC 2616. The RFC has a bit of clarification but also doesn't avoid this ambiguity. Specifically, it states that 'The backslash character ("") MAY be used as a single-character quoting mechanism only within quoted-string and comment constructs.', but that only specifies where the backslash may be used as an escape character and does not reject its use as a literal backslash in qdtext. So the same ambiguity remains there.

RFC 7230 redefined qdtext = HTAB / SP /%x21 / %x23-5B / %x5D-7E / obs-text (where obs-text = %x80-FF), i.e. removed the backslash (and also the continuation lines).

@JustAnotherArchivist
Copy link
Author

Aah, crossfire.

Yeah, the 7230 definition makes much more sense. It probably can't simply be taken over to WARC though due to the continuation lines, which were deprecated in 7230, and indeed the quoted-string redefinition would not allow them. Whether that's a concern is another matter of course; personally, I've always hated and never understood the need for line folding in the first place.

@JustAnotherArchivist
Copy link
Author

JustAnotherArchivist commented Jul 9, 2021

Actually, the line folding does have an undesired effect: if you want to represent any white space other than a single space in a quoted-string, you must escape it [0]. That's because qdtext is defined via TEXT, which includes LWS, which collapses any positive number of spaces or tabs into a single space semantically. 7230's definition does not have this problem because it doesn't make use of LWS, so white space inside a qdtext is taken literally.

However, there is also an issue with 7230's definition: control characters cannot be included in a quoted-string. But a WARC-Filename value must be able to contain any byte sequence because that's how file systems work at least in the Unix-like world: the only prohibited byte in a path is NUL...

[0] Edit: Minor correction: you can also represent N spaces with N line folds. For example, a field X-Header with the value of foo bar (that's two spaces between the words, which GitHub doesn't display correctly) could be represented as (in ABNF notation): "X-Header: " <"> "foo" CRLF SP CRLF SP "bar" <">.
There is also a solution using trailing white space, which is collapsed into a single space via LWS as well. For example, the value above could be encoded as <"> "foo " CRLF SP "bar" <">. There could be any positive number of SP or HT after foo without changing the field value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants