Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify SAM file encoding (mostly ASCII, some UTF-8 parts) #670

Merged
merged 1 commit into from
May 15, 2023

Conversation

jmarshall
Copy link
Member

Draft text clarifying the paragraph around file encoding, locales, and regular expressions.
Addresses #664.

I would like to spell it out in the first sentence a bit more but haven't come up with a great wording yet; something like

SAM files are encoded in UTF-8 using mainly ASCII characters only.

@hts-specs-bot
Copy link

hts-specs-bot commented Sep 13, 2022

Changed PDFs as of 3d3da3a: SAMv1 (diff).

@jkbonfield jkbonfield added the sam label Oct 10, 2022
@hts-specs-bot
Copy link

Changed PDFs as of 81b8058: SAMv1 (diff). This link will expire in 30 days

@jmarshall jmarshall force-pushed the encoding branch 3 times, most recently from 8987c26 to 0dc7e3d Compare October 11, 2022 12:15
@github-actions
Copy link

Changed PDFs as of 0dc7e3d: SAMv1 (diff).

@jkbonfield
Copy link
Contributor

I'm happy with the new text, although I note it's draft still.

PS. The rendering of the diff PDF is visible in Opera and MS Edge on Windows, but firefox shows the hex codes for unknown font symbol. I'm not sure what's going on there, but maintainers only need 1 working browser for validation and the final PDF is fine. :)

@github-actions
Copy link

github-actions bot commented May 4, 2023

Changed PDFs as of 1f92c08: SAMv1 (diff).

@jmarshall
Copy link
Member Author

I was less happy with the new text, I think because it said it was UTF-8 first and then adjusted it to mostly-ASCII second. I've turned that around again, and moved most of the text out of footnotes, so I think it's clearer and with a more accurate emphasis now.

Unicode adds several more line termination characters, so I added a footnote pointing out that only ASCII line terminators are valid.

And I've proposed a second commit using @jkbonfield's suggestion on #719 of using [:print:] in the regex. This might be a good idea, or it might be better to leave the regex as is and add a footnote saying “uh, for UTF-8 field values we actually mean [ -0xFFFE] or something”.

@jmarshall jmarshall marked this pull request as ready for review May 4, 2023 10:29
@jmarshall jmarshall changed the title [DRAFT] Clarify SAM file encoding (ASCII, UTF-8 "subset") Clarify SAM file encoding (mostly ASCII, some UTF-8 parts) May 4, 2023
@github-actions
Copy link

github-actions bot commented May 4, 2023

Changed PDFs as of 892fed3: SAMv1 (diff).

@jmarshall
Copy link
Member Author

The page breaking is now horrible, splitting the table between VN and SO. I have a followup PR that rearranges the first section a bit to bring both figures in the SAM example back to the first page, which restores the previous better page breaks in the header section.

@jkbonfield
Copy link
Contributor

And I've proposed a second commit using @jkbonfield's suggestion on #719 of using [:print:] in the regex. This might be a good idea, or it might be better to leave the regex as is and add a footnote saying “uh, for UTF-8 field values we actually mean [ -0xFFFE] or something”.

I'd be wary of attempting to specify unicode with a hex range. We could, but I'm not 100% sure what "printable" even means in unicode. Is the block replacement character (FFFD) printable? It's there for when decoding fails I think. Also why to FFFE? There are more beyond that. 𐘬 :-)

I just think it's a hiding to nothing and frankly [ -0xFFFE] does nothing to clarify the intention of the regular expression. There is no easy solution here though.

@jmarshall
Copy link
Member Author

jmarshall commented May 4, 2023

Indeed, that pseudo-range was mostly a joke — like the ‘uh’ and ‘or something’, I wasn't suggesting that we'd add it to the spec verbatim. I didn't bother looking up what the “last” (printable) Unicode character is. In the proposed commit, I've used your [:print:] suggestion along with a footnote.

@github-actions
Copy link

github-actions bot commented May 4, 2023

Changed PDFs as of 60a19a1: SAMv1 (diff).

@github-actions
Copy link

Changed PDFs as of 1011356: SAMv1 (diff).

@jmarshall
Copy link
Member Author

I've split the second commit adjusting the header regex off as a separate PR, #726.

This PR is now a fairly straightforward clarification to a single paragraph. I'd like to get it merged so that I can prepare the followup formatting improvement PR that affects the same sections. So let's start a one week timer for this one: if there are no objections before then, I will merge this on Friday May 19th.

@jkbonfield: Feel free to review and merge this before then if you wish.

@jkbonfield
Copy link
Contributor

Looks good to me, thanks. I'll merge it.

…ls#670)

Reword so that it is clear that the encodings specified apply to the
*entirety* of SAM file contents. Mention allowable line terminators,
and note that most UTF-8 line terminating characters are invalid
(as this whitespace is in the ASCII-only parts outwith UTF-8 fields).
Fixes samtools#664.

Spell out that the locale considerations are about e.g. requiring that
floating-point values use '.' rather than a localised decimal point.
@jkbonfield jkbonfield merged commit 7cfd789 into samtools:master May 15, 2023
1 check passed
@jkbonfield jkbonfield temporarily deployed to github-pages May 15, 2023 10:50 — with GitHub Pages Inactive
@github-actions
Copy link

Changed PDFs as of 7cfd789: SAMv1 (diff).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants