Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phred scale clarifications #579

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Jun 16, 2021

This fixes #534.

We abuse the term Phred-scale, particularly in VCF. This PR also has a minor wording change in SAM to be more explicit.

Sometimes VCF uses Phred as -10log_10(P(correct)) and sometimes -10log_10(P(incorrect)). Having two completely opposite direction scales with bigger-is-better vs smaller-is-better is really unhelpful. Plus sometimes it's not even probabilities at all, but likelihoods. Sometimes they're normalised, and sometimes not. This poor wording lead me up the garden path for a while and it wasn't until I checked source code that I understand what they meant, mainly due to preconceived (correct!) notions on what "Phred-scaled" means. Phred explicitly defined the scale as P(incorrect), so we now only use the term only when that is true.

I see this was already done correctly for the CNL, CNP and CNQ VCF tags, so I applied the same logic to the PL, PP and PQ tags.

Hopefully this also helps improve the understanding of GL (log10(p)) vs PL (-10log10(p)) by being more explicit instead of having formulae for defining one and a vague term for the other.

@hts-specs-bot
Copy link

Changed PDFs as of 7b37be1: SAMv1 (diff), VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of 7cafb52: SAMv1 (diff), VCFv4.3 (diff).

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea. Some small comments below.

SAMv1.tex Outdated Show resolved Hide resolved
VCFv4.3.tex Outdated
@@ -544,13 +544,14 @@ \subsection{VCF tag naming conventions}
\begin{itemize}
\item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$.
Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL).
The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.\ PL).
The likelihood can be also represented in some cases as a phred-true scale ($-10 \log_{10}(probability\_of\_being\_correct)$) in a separate tag (e.g.\ PL).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to introduce the term "phred-true scale"? It seems like a bit of a strange one off. It's not clear that it's contrasting probability of correct with probability of incorrect until you read the 'Q Suffix' box below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but I'm unsure how best to describe it given the "P" prefix was almost certainly chosen for Phred.

Right now it states "Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers". The intent of the phred-ish scale appears to be to negate that and have a positive scale, so maybe this is the correct way of describing it. I'll have a stab at doing something more.

@jkbonfield
Copy link
Contributor Author

Rebased off current master, with the line-wrapping adjusted for SAMv1.tex.

The "phred-true-scale" sentence has been rewritten as left in as a separate commit so you can see the change, but should be squashed with the other VCF change when merging.

Note this doesn't change vcf 4.4 as I don't know what our policies are on versions to update. Which versions need to be changed? Do we have one commit per spec version so we can search git logs easier, or do we combine them together in one commit?

@hts-specs-bot
Copy link

Changed PDFs as of 571f96c: SAMv1 (diff), VCFv4.3 (diff).

@jkbonfield
Copy link
Contributor Author

Applied the changes to VCF4.4 draft and, minimally to VCF4.2. I also squashed the previous commit and this change into the main VCF doc updates. For review purposes the changes applied to VCF 4.3 (bar a minor grammar fix) since previous review are still visible as 571f96c

@hts-specs-bot
Copy link

Changed PDFs as of d3da996: SAMv1 (diff), VCFv4.2 (diff), VCFv4.3 (diff), VCFv4.4.draft (diff).

@hts-specs-bot
Copy link

Changed PDFs as of e7eddf0: SAMv1 (diff), VCFv4.2 (diff), VCFv4.3 (diff), VCFv4.4.draft (diff).

VCFv4.3.tex Outdated
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
PQ & 1 & Integer & Phasing quality \\
PL & G & Integer & $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer \\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting nits: use \log to print a function in roman (and set off by thinspace from the -10) rather than badly spaced single-letter variables in italics; IMHO it looks better keeping the hyphen before -scaled (and it's typeset very differently from a minus sign so shouldn't be confusing).

There's already a bunch of 10 log 10 in VCF*.tex that are missing the \log, but this PR adds a bunch more instances so should probably use \log in its additions.

Suggested change
PL & G & Integer & $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer \\
PL & G & Integer & $-10\log_{10}$-scaled genotype likelihoods rounded to the closest integer \\

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah just noticed the "-scaled" bit too... I'll update again. :-)

@jkbonfield
Copy link
Contributor Author

I fixed the log vs \log comment in two new commits, with the first best squashed. I don't know if it needs squashing down or up though! (Or maybe all 3 together?).

The first commit fixes the use of log in the changes I had made, but only those lines. Hence it could be squashed in with that.

The second commit fixes use the of log in all other lines, for the sake of consistency. This is purely a formatting commit so it's possibly worth keeping it separate from the wording changes.

@hts-specs-bot
Copy link

Changed PDFs as of 3d38893: SAMv1 (diff), VCFv4.2 (diff), VCFv4.3 (diff), VCFv4.4.draft (diff).

@jmarshall
Copy link
Member

I think two commits would be appropriate: (1) fix existing \log instances, not really part of this PR; (2) add the new text, already with \log throughout. That way around, it's always consistent.

(I once had a draft of fixing all the existing \log and standardising on using $ \Pr(foo|bar) $ for all the variously worded probabilities, but it appears I lost interest along the way… or maybe it got partly applied and then other log instances crept back in…)

Sometimes this refers to $10 log_{10}(p)$, sometimes to $10
log_{10}(1-p)$, and sometimes to something normalised so $p$ isn't
really a probability at all.

Note CNL, CNP and CNQ don't mention phred anywhere in their
short description and only Phred in the long description for CNQ, so
I applied the same logic to PL, PP (is this correct?) and PQ.

Also clarified the "VCF tag naming conventions" part.  I changed
phred-scale in one part there to phred-true-scale.  I'm not so happy
with that, but as it's immediately followed by the formula I think
it's clear.
@hts-specs-bot
Copy link

Changed PDFs as of 31feca3: SAMv1 (diff), VCFv4.2 (diff), VCFv4.3 (diff), VCFv4.4.draft (diff).

@d-cameron
Copy link
Contributor

At the next file format meeting we should decide on a policy on how far we backport clarifications and how we want to balance specifications stability, maintenance burden, and specifications clarity (same issue with #652).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To do (backlog)
Development

Successfully merging this pull request may close these issues.

Phred scales vs qualities/scores
5 participants