Skip to content

Commit

Permalink
Clarify the multiple incompatible uses of "phred-scale".
Browse files Browse the repository at this point in the history
Sometimes this refers to $10 log_{10}(p)$, sometimes to $10
log_{10}(1-p)$, and sometimes to something normalised so $p$ isn't
really a probability at all.

Note CNL, CNP and CNQ don't mention phred anywhere in their
short description and only Phred in the long description for CNQ, so
I applied the same logic to PL, PP (is this correct?) and PQ.

Also clarified the "VCF tag naming conventions" part.  I changed
phred-scale in one part there to phred-true-scale.  I'm not so happy
with that, but as it's immediately followed by the formula I think
it's clear.
  • Loading branch information
jkbonfield committed Jun 16, 2021
1 parent c236c44 commit 7b37be1
Showing 1 changed file with 9 additions and 7 deletions.
16 changes: 9 additions & 7 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -427,9 +427,9 @@ \subsubsection{Genotype fields}
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
MQ & 1 & Integer & RMS mapping quality \\
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
PQ & 1 & Integer & Phasing quality \\
PL & G & Integer & $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer\\
PP & G & Integer & $-10 log_{10}$ scaled genotype posterior probabilities rounded to the closest integer\\
PQ & 1 & Integer & Phred-scaled phasing quality\\
PS & 1 & Integer & Phase set \\
\end{longtable}
Expand Down Expand Up @@ -515,8 +515,8 @@ \subsubsection{Genotype fields}
\item HQ (Integer): Haplotype qualities, two comma separated phred qualities.
\item MQ (Integer): RMS mapping quality, similar to the version in the INFO field.
\item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
\item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
\item PL (Integer): The $log_{10}$ scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
\item PP (Integer): The $log_{10}$ scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
\item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).
We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality.
\item PS (non-negative 32-bit Integer): Phase set, defined as a set of phased genotypes to which this genotype belongs.
Expand Down Expand Up @@ -544,13 +544,14 @@ \subsection{VCF tag naming conventions}
\begin{itemize}
\item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$.
Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL).
The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.\ PL).
The likelihood can be also represented in some cases as a phred-true scale ($-10 \log_{10}(probability\_of\_being\_correct)$) in a separate tag (e.g.\ PL).
In this case they may be normalised so the most likely event has a score of 0.
\item The `P' suffix means \emph{probability} as linear-scale probability in the posterior distribution, which is $\Pr(\mathrm{Model}|\mathrm{Data})$. Examples are GP, CNP.
\item The `Q' suffix means \emph{quality} as log-complementary-phred-scale posterior probability, $-10 \log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$, where the model is the most likely genotype that appears in the GT field.
Examples are GQ, CNQ.
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number).
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number with $QUAL = -10 \log_{10}(probability\_of\_being\_incorrect)$).
\end{itemize}
Expand Down Expand Up @@ -2085,6 +2086,7 @@ \section{List of changes}
\subsection{Changes to VCFv4.3}
\begin{itemize}
\item Clarify distinction between Phred ($-10 log_{10}(p\_of\_incorrect)$) and $-10 log_{10}(p\_of\_correct)$.
\item More strict language: ``should'' replaced with ``must'' where appropriate
\item Tables with Type and Number definitions for INFO and FORMAT reserved keys
Expand Down

0 comments on commit 7b37be1

Please sign in to comment.