Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that @RG PI values are integers #721

Merged
merged 1 commit into from
May 23, 2023
Merged

Conversation

jmarshall
Copy link
Member

From the field's description as an insert size, the value is clearly numeric. Further, HTSJDK has required it to be an integer since 2010 (see samtools/htsjdk@2799b1f, 2010-02-23, "Add getter and setter for predicted median insert size to SAMReadGroupRecord").

So the proposal is to clarify that PI values should be integers, as enforced by one of the widely used implementations. Furthermore as insert sizes, the values are naturally non-negative. Fixes #720.

Also adjust test files that had PI:<floating-point value>.

@github-actions
Copy link

Changed PDFs as of 930ada1: SAMv1 (diff).

SAMv1.tex Outdated
@@ -315,7 +315,7 @@ \subsection{The header section}
& {\tt KS} & The array of nucleotide bases that correspond to the key sequence of each read.\\\cline{2-3}
& {\tt LB} & Library.\\\cline{2-3}
& {\tt PG} & Programs used for processing the read group.\\\cline{2-3}
& {\tt PI} & Predicted median insert size.\\\cline{2-3}
& {\tt PI} & Predicted median insert size. \emph{Valid values}: Integers matching \verb"/^[0-9]+$/"\\\cline{2-3}
Copy link
Contributor

@jkbonfield jkbonfield May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ^(0|[1-9][0-9]*)$? I'm concerned about questions over 0123 should be interpreted as octal. I think htsjdk would do so already, as would any C tool using strtol or atoi.

I feel we're best off just rejecting such data.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description should also be updated, e.g., to rounded predicted median insert size. Otherwise, further text is owed for as to why it's an integer rather than a decimal value.

@jmarshall jmarshall force-pushed the rg-pi branch 3 times, most recently from 302de8f to f5facde Compare May 23, 2023 10:18
@samtools samtools deleted a comment from github-actions bot May 23, 2023
@github-actions
Copy link

Changed PDFs as of f5facde: SAMv1 (diff).

@jmarshall
Copy link
Member Author

The rest of the spec is agnostic re decimal versus octal versus etc: e.g., the table in §1.4 just says “Int” (and gives ranges) for FLAG, POS, MAPQ, PNEXT, and TLEN and offers no further hints as to how these Ints are to be represented in text. I don't propose to address this in this PR (or anytime soon), so I've reworded it to just say it's an integer without any regex.

I'm not sure I agree that the spec is obliged to say why this is an integer, but nonetheless I've replaced the previous text added with just “[…], rounded to the nearest integer”.

@jkbonfield
Copy link
Contributor

Looks good to me.

From the field's description as an insert size, the value is clearly
numeric (and non-negative). Further, HTSJDK has required it to be an
integer since 2010 (see HTSJDK 2799b1f30b9d53a2949861a31bb85bfda08a4023,
2010-02-23, "Add getter and setter for predicted median insert size to
SAMReadGroupRecord"). Fixes samtools#720.

Adjust test files that had `PI:<floating-point value>`.
@jmarshall jmarshall merged commit 8b8fe6d into samtools:master May 23, 2023
1 check failed
@jmarshall jmarshall deleted the rg-pi branch May 23, 2023 19:42
@jmarshall jmarshall temporarily deployed to github-pages May 23, 2023 19:42 — with GitHub Pages Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sam: Define format of read group predicted median insert size
3 participants