Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam: Clarify the usage of UTF-8 characters in header #719

Open
zaeleus opened this issue Apr 28, 2023 · 3 comments · May be fixed by #726
Open

sam: Clarify the usage of UTF-8 characters in header #719

zaeleus opened this issue Apr 28, 2023 · 3 comments · May be fixed by #726
Labels

Comments

@zaeleus
Copy link

zaeleus commented Apr 28, 2023

This is in regard to Sequence Alignment/Map Format Specification (2022-08-22).

§ 1.3 "The header section" defines patterns for header lines:

Thus header lines match /^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/.

This invalidates the following test examples:

The text "UTF-8 encoding may be used" for the CL and DS fields does not remove the character set constraint. It also remains arbitrary as to why only some fields have this definition.

@jmarshall jmarshall added the sam label Apr 29, 2023
@jmarshall
Copy link
Member

The problem here is simply that the regular expression has not been updated after the addition of the loosening to allow UTF-8 in certain field values. These test files are not made invalid, and the ”UTF-8 encoding may be used [in these field values]” text should be genially interpreted as augmenting and superseding the ASCII restriction in that part of the regular expression.

That said, there is indeed a minor inconsistency here and we should improve the [ -~]+ part of the regular expression so that it allows for UTF-8 characters. And perhaps add some text nearby stating that UTF-8 fields are the exception and most fields are limited to [ -~]+.

What it is really trying to say is that field values are strings of one or more non-TAB characters. In expanding this to allow UTF-8 characters, we should consider whether there are any other UTF-8 whitespace characters that we might want to disallow.

It is indeed true that only a few particularly and individually specified fields allow UTF-8 characters beyond ASCII. IMHO the spec doesn't particularly need to explain why the particular choices have been made. OTOH it is useful for the lore to be known amongst the spec's maintainers and contributors.

The general rule of thumb is that anything that would be processed or interpreted by a program is restricted to ASCII. Free text fields that are ignored by programs or are simply displayed to the user (but not interpreted by program code) are candidates for specifying as allowing UTF-8.

The justification for this is that doing things like testing whether two strings represent the same text is non-trivial in UTF-8, as noted by @daviesrob in ga4gh/seqcol-spec#2 (comment) and probably in other similar discussions too. Thus for example allowing UTF-8 in RNAME fields would make it difficult to determine whether two SAM records referred to the same chromosome while not adding any particular expressive power.

@jkbonfield
Copy link
Contributor

jkbonfield commented May 2, 2023

We clearly need a footnote somewhere to explain the regexp may need some nuanced interpretation for UTF-8 capable fields. Attempting to include UTF-8 in the regexp will just break things horribly and add more confusion than it solves. I think maybe one way of handling this is to use POSIX character classes as they document the meaning of the terms. [ -~] is basically [[:print:]] as far as I can see. If so, we can then add a footnote stating:

In ASCII-only fields [[:print:]] is equivalent to [ -~], but note a few fields permit UTF-8 and the regular expression should not be taken as resricting that data.

Edit: actually if locale is set correctly, it may be that e.g. POSIX [[:alpha:]] really does mean alphabetical even for UTF-8 symbols, depending on the regexp engine you're using. Basically it's one way of dodging the question. It still does warrant a footnote though to be explicit about ASCII vs UTF-8. Basically the regular expression doesn't cut it for the entire header as it's too broad given different fields have different locales, but it gets the concept over well.

jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 4, 2023
Use `[:print:]` in the header regex and note that for ASCII it is
equivalent to `[ -~]` and that the aim is to forbid control characters.
Fixes samtools#719.
@jmarshall
Copy link
Member

jmarshall commented May 4, 2023

There are lots of Unicode whitespace characters, but fortunately none of them have tab-like properties. There are however additional line separator characters, so I have added text to PR #670 (which is no longer draft) to emphasise that SAM file line endings must be ASCII — in particular LF or CR-LF.

we should consider whether there are any other UTF-8 whitespace characters that we might want to disallow.

Upon further reflection, I don't think we should do this. Disallowing them would invite people to think that such characters would act as field or line separators, but we want parsing those delimiters to remain doable by looking only for ASCII. This means that people could put U+2028 (line separator) characters in their @SQ-DS fields and have the @SQ header appear to span multiple lines. We can call this a feature 😄

I think maybe one way of handling this is to use POSIX character classes as they document the meaning of the terms. [ -~] is basically [[:print:]] as far as I can see.

That's a pretty good idea. (And [!-~] is [[:graph:]].) I have made PR #726 to try it out.

What the existing [ -~] regex really does is forbid control characters in these field values: no \r or \n, no tab, and no NUL characters. The rest of the control characters we don't really care about, but it is as well to forbid them all. This character class trick is a good way to extend that to all of Unicode.

jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 4, 2023
Use `[:print:]` in the header regex and note that for ASCII it is
equivalent to `[ -~]` and that the aim is to forbid control characters.
Fixes samtools#719.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 12, 2023
Use `[:print:]` in the header regex and note that for ASCII it is
equivalent to `[ -~]` and that the aim is to forbid control characters.
Fixes samtools#719.
jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 16, 2023
Use `[:print:]` in the header regex and note that for ASCII it is
equivalent to `[ -~]` and that the aim is to forbid control characters.
Fixes samtools#719.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Needs review
Development

Successfully merging a pull request may close this issue.

3 participants