Skip to content

Commit

Permalink
Clarify the multiple incompatible uses of "phred-scale".
Browse files Browse the repository at this point in the history
Sometimes this refers to $10 log_{10}(p)$, sometimes to $10
log_{10}(1-p)$, and sometimes to something normalised so $p$ isn't
really a probability at all.

Note CNL, CNP and CNQ don't mention phred anywhere in their
short description and only Phred in the long description for CNQ, so
I applied the same logic to PL, PP (is this correct?) and PQ.

Also clarified the "VCF tag naming conventions" part.  I changed
phred-scale in one part there to phred-true-scale.  I'm not so happy
with that, but as it's immediately followed by the formula I think
it's clear.
  • Loading branch information
jkbonfield committed Aug 24, 2021
1 parent 7137b57 commit d3da996
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 15 deletions.
2 changes: 1 addition & 1 deletion VCFv4.2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ \subsubsection{Genotype fields}
\item FT : sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (String, no whitespace or semicolons permitted)
\item GL : genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)
\item GLE : genotype likelihoods of heterogeneous ploidy, used in presence of uncertain copy number. For example: GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53 (String)
\item PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as the GL field) (Integers)
\item PL : the $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field (Integers).
\item GP : the phred-scaled genotype posterior probabilities (and otherwise defined precisely as the GL field); intended to store imputed genotype probabilities (Floats)
\item GQ : conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant) (Integer)
\item HQ : haplotype qualities, two comma separated phred qualities (Integers)
Expand Down
16 changes: 9 additions & 7 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -427,9 +427,9 @@ \subsubsection{Genotype fields}
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
MQ & 1 & Integer & RMS mapping quality \\
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
PQ & 1 & Integer & Phasing quality \\
PL & G & Integer & $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer \\
PP & G & Integer & $-10 log_{10}$ scaled genotype posterior probabilities rounded to the closest integer \\
PQ & 1 & Integer & Phred-scaled phasing quality \\
PS & 1 & Integer & Phase set \\
\end{longtable}
Expand Down Expand Up @@ -515,8 +515,8 @@ \subsubsection{Genotype fields}
\item HQ (Integer): Haplotype qualities, two comma separated phred qualities.
\item MQ (Integer): RMS mapping quality, similar to the version in the INFO field.
\item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
\item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
\item PL (Integer): The $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
\item PP (Integer): The $-10 log_{10}$ scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
\item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).
We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality.
\item PS (non-negative 32-bit Integer): Phase set, defined as a set of phased genotypes to which this genotype belongs.
Expand Down Expand Up @@ -544,13 +544,14 @@ \subsection{VCF tag naming conventions}
\begin{itemize}
\item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$.
Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL).
The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.\ PL).
In some cases the likelihood may also be represented using a positive value in a separate tag (e.g.\ PL) using the $-10 \log_{10}(probability\_of\_being\_correct)$ scale.
In this case they may be normalised so the most likely event has a score of 0.
\item The `P' suffix means \emph{probability} as linear-scale probability in the posterior distribution, which is $\Pr(\mathrm{Model}|\mathrm{Data})$. Examples are GP, CNP.
\item The `Q' suffix means \emph{quality} as log-complementary-phred-scale posterior probability, $-10 \log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$, where the model is the most likely genotype that appears in the GT field.
Examples are GQ, CNQ.
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number).
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number with $QUAL = -10 \log_{10}(probability\_of\_being\_incorrect)$).
\end{itemize}
Expand Down Expand Up @@ -2085,6 +2086,7 @@ \section{List of changes}
\subsection{Changes to VCFv4.3}
\begin{itemize}
\item Clarify distinction between Phred ($-10 log_{10}(p\_of\_incorrect)$) and $-10 log_{10}(p\_of\_correct)$.
\item More strict language: ``should'' replaced with ``must'' where appropriate
\item Tables with Type and Number definitions for INFO and FORMAT reserved keys
Expand Down
16 changes: 9 additions & 7 deletions VCFv4.4.draft.tex
Original file line number Diff line number Diff line change
Expand Up @@ -432,9 +432,9 @@ \subsubsection{Genotype fields}
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
MQ & 1 & Integer & RMS mapping quality \\
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
PQ & 1 & Integer & Phasing quality \\
PL & G & Integer & $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer \\
PP & G & Integer & $-10 log_{10}$ scaled genotype posterior probabilities rounded to the closest integer \\
PQ & 1 & Integer & Phred-scaled phasing quality \\
PS & 1 & Integer & Phase set \\
\end{longtable}
Expand Down Expand Up @@ -520,8 +520,8 @@ \subsubsection{Genotype fields}
\item HQ (Integer): Haplotype qualities, two comma separated phred qualities.
\item MQ (Integer): RMS mapping quality, similar to the version in the INFO field.
\item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
\item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
\item PL (Integer): The $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
\item PP (Integer): The $-10 log_{10}$ scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
\item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).
We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality.
\item PS (non-negative 32-bit Integer): Phase set, defined as a set of phased genotypes to which this genotype belongs.
Expand Down Expand Up @@ -549,13 +549,14 @@ \subsection{VCF tag naming conventions}
\begin{itemize}
\item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$.
Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL).
The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.\ PL).
In some cases the likelihood may also be represented using a positive value in a separate tag (e.g.\ PL) using the $-10 \log_{10}(probability\_of\_being\_correct)$ scale.
In this case they may be normalised so the most likely event has a score of 0.
\item The `P' suffix means \emph{probability} as linear-scale probability in the posterior distribution, which is $\Pr(\mathrm{Model}|\mathrm{Data})$. Examples are GP, CNP.
\item The `Q' suffix means \emph{quality} as log-complementary-phred-scale posterior probability, $-10 \log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$, where the model is the most likely genotype that appears in the GT field.
Examples are GQ, CNQ.
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number).
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number with $QUAL = -10 \log_{10}(probability\_of\_being\_incorrect)$).
\end{itemize}
Expand Down Expand Up @@ -2223,6 +2224,7 @@ \subsection{Changes between VCFv4.4 and VCFv4.3}
\subsection{Changes to VCFv4.3}
\begin{itemize}
\item Clarify distinction between Phred ($-10 log_{10}(p\_of\_incorrect)$) and $-10 log_{10}(p\_of\_correct)$.
\item More strict language: ``should'' replaced with ``must'' where appropriate
\item Tables with Type and Number definitions for INFO and FORMAT reserved keys
Expand Down

0 comments on commit d3da996

Please sign in to comment.