Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for UTF-8 field values in header regular expression #726

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jmarshall
Copy link
Member

Use [:print:] in the header regex and note that for ASCII it is equivalent to [ -~] and that the aim is to forbid control characters. Fixes #719.

To be honest, I'm tempted to add the extra [] to the \cclass definition and waste a bit of space each time this appears rather than add the “For brevity” sentence.

An alternative to this PR might be to just leave the regex as [ -~] and add a footnote explaining that this is an oversimplification for fields that allow Unicode values.

@jmarshall jmarshall added the sam label May 12, 2023
@github-actions
Copy link

Changed PDFs as of 229e998: SAMv1 (diff).

Use `[:print:]` in the header regex and note that for ASCII it is
equivalent to `[ -~]` and that the aim is to forbid control characters.
Fixes samtools#719.
@github-actions
Copy link

Changed PDFs as of 3b4e874: SAMv1 (diff).

@jkbonfield
Copy link
Contributor

I'm unsure on the extra brackets also. My inclination though is it's probably not worth the hassle of inventing our own syntax and just going with the official double bracket style.

I'm guessing the extra brackets however were to permit things like [:[:alnum:]] being [:A-Za-z0-9] without ambiguity.

@zaeleus
Copy link

zaeleus commented Jun 1, 2023

Note that UTS #18: Annex C suggests :print: character class compatibility for Unicode as \p{graph}\p{blank}--\p{cntrl}, which is likely not the appropriate definition here since it includes :blank:. It may be better to note a property matcher instead, e.g., \P{Other}.

@jmarshall
Copy link
Member Author

Which specific :blank: characters are you worried about inadvertently including? (The obvious worry is TAB, but presumably that is removed by \p{cntrl}.)

@zaeleus
Copy link

zaeleus commented Jun 1, 2023

Ah, yes, you're right. I had a set operation wrong when I tested with an example. :print: seems sufficient for this change.

@jkbonfield
Copy link
Contributor

I see I was assigned this in the last meeting.

Personally my preference is [[:print:]] over [:print:] as it's a standard and ironically the extra couple of characters we add a few times is less text than the "for brevity" statement. Not a hard "must be so" line, but if in agreement I'd prefer that before merging. Otherwise I'm happy with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Needs review
Development

Successfully merging this pull request may close these issues.

sam: Clarify the usage of UTF-8 characters in header
3 participants