samtools · jkbonfield · Jun 16, 2021 · Jun 16, 2021 · Sep 16, 2021
diff --git a/SAMv1.tex b/SAMv1.tex
@@ -167,8 +167,9 @@ \subsection{Terminologies and Concepts}
   between the 3rd and the 7th bases inclusive is $[2,7)$. The BAM, BCFv2, BED,
   and PSL formats are using the 0-based coordinate system.
 
-\item[Phred scale] Given a probability $0<p\le 1$, the phred scale of $p$
-  equals $-10\log_{10}p$, rounded to the closest integer.
+\item[Phred scale] Given a probability $0<p\le 1$ of an erroneous
+  call, the phred scale of $p$ equals $-10\log_{10}p$, rounded to the
+  closest integer.
 
 \end{description}
 

diff --git a/VCFv4.2.tex b/VCFv4.2.tex
@@ -181,7 +181,7 @@ \subsubsection{Fixed fields}
   \item ID - identifier: Semicolon-separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no whitespace or semicolons permitted)
   \item REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g.\ complex substitutions or other events where all alleles have at least one base represented in their Strings.  If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required).
   \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles.  These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. The `*' allele is reserved to indicate that the allele is missing due to a upstream deletion. If there are no alternative alleles, then the missing value should be used.  Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive.  (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
-  \item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant). If unknown, the missing value should be specified. (Numeric)
+  \item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10\log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10\log_{10}$ prob(variant), and if ALT is not `.' this is $-10\log_{10}$ prob(no variant). If unknown, the missing value should be specified. (Numeric)
   \item FILTER - filter status: PASS if this position has passed all filters, i.e., a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.\ ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples. `0' is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no whitespace or semicolons permitted)
   \item INFO - additional information: (String, no whitespace, semicolons, or equals-signs permitted; commas are permitted only as delimiters for lists of values) INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: $<$key$>$=$<$data$>$[,data]. If no keys are present, the missing value must be used. Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):
 \begin{itemize}
@@ -221,11 +221,11 @@ \subsubsection{Genotype fields}
 	\end{itemize}
   \item DP : read depth at this position for this sample (Integer)
   \item FT : sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (String, no whitespace or semicolons permitted)
-  \item GL : genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.  In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc.  For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)
+  \item GL : genotype likelihoods comprised of comma separated floating point $\log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.  In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc.  For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)
   \item GLE : genotype likelihoods of heterogeneous ploidy, used in presence of uncertain copy number. For example: GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53 (String)
-  \item PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as the GL field) (Integers)
+  \item PL : the $-10 \log_{10}$ scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field (Integers).
   \item GP : the phred-scaled genotype posterior probabilities (and otherwise defined precisely as the GL field); intended to store imputed genotype probabilities (Floats)
-  \item GQ : conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant) (Integer)
+  \item GQ : conditional genotype quality, encoded as a phred quality $-10\log_{10}$ p(genotype call is wrong, conditioned on the site's being variant) (Integer)
   \item HQ : haplotype qualities, two comma separated phred qualities (Integers)
   \item PS : phase set.  A phase set is defined as a set of phased genotypes to which this genotype belongs.  Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set.  A phase set specifies multi-marker haplotypes for the phased genotypes in the set.  All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.  If the genotype in the GT field is unphased, the corresponding PS field is ignored.  The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer)
   \item PQ : phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).  We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality. (Integer)
@@ -311,7 +311,7 @@ \section{FORMAT keys used for structural variants}
 ##FORMAT=<ID=AHAP,Number=1,Type=Integer,Description="Unique identifier of ancestral haplotype">
 \end{verbatim}
 \normalsize
-These keys are analogous to GT/GQ/GL and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined). CN specifies the integer copy number of the variant in this sample. CNQ is encoded as a phred quality $-10log_{10}$ p(copy number genotype call is wrong). CNL specifies a list of $log_{10}$ likelihoods for each potential copy number, starting from zero. When possible, GT/GQ/GL should be used instead of (or in addition to) these keys.
+These keys are analogous to GT/GQ/GL and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined). CN specifies the integer copy number of the variant in this sample. CNQ is encoded as a phred quality $-10\log_{10}$ p(copy number genotype call is wrong). CNL specifies a list of $\log_{10}$ likelihoods for each potential copy number, starting from zero. When possible, GT/GQ/GL should be used instead of (or in addition to) these keys.
 
 \section{Representing variation in VCF records}
 \subsection{Creating VCF entries for SNPs and small indels}

diff --git a/VCFv4.3.tex b/VCFv4.3.tex
@@ -322,8 +322,8 @@ \subsubsection{Fixed fields}
   In other words, the ALT field must be a symbolic allele, or a breakend replacement string, or match the regular expression \texttt{\^{}([ACGTNacgtn]+|\string\*|\string\.)\$}.
   Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive.
   (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
-  \item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10log_{10}$ prob(call in ALT is wrong).
-  If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant).
+  \item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10\log_{10}$ prob(call in ALT is wrong).
+  If ALT is `.' (no variant) then this is $-10\log_{10}$ prob(variant), and if ALT is not `.' this is $-10\log_{10}$ prob(no variant).
   If unknown, the MISSING value must be specified. (Float)
   \item FILTER --- filter status: PASS if this position has passed all filters, i.e., a call is made at this position.
   Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.\ ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples.
@@ -427,9 +427,9 @@ \subsubsection{Genotype fields}
       GT		& 1		& String	& Genotype \\
       HQ		& 2		& Integer	& Haplotype quality \\
       MQ		& 1		& Integer	& RMS mapping quality \\
-      PL		& G		& Integer	& Phred-scaled genotype likelihoods rounded to the closest integer \\
-      PP		& G		& Integer	& Phred-scaled genotype posterior probabilities rounded to the closest integer \\
-      PQ		& 1		& Integer	& Phasing quality \\
+      PL		& G		& Integer	& $-10\log_{10}$-scaled genotype likelihoods rounded to the closest integer \\
+      PP		& G		& Integer	& $-10\log_{10}$-scaled genotype posterior probabilities rounded to the closest integer \\
+      PQ		& 1		& Integer	& Phred-scaled phasing quality \\
       PS		& 1		& Integer	& Phase set \\
 \end{longtable}
 
@@ -443,7 +443,7 @@ \subsubsection{Genotype fields}
   Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied.
   These values should be described in the meta-information in the same way as FILTERs.
   No whitespace or semicolons permitted.
-  \item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
+  \item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10\log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
   \item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities.
   \item GT (String): Genotype, encoded as allele values separated by either of $/$ or $\mid$.
   The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
@@ -457,7 +457,7 @@ \subsubsection{Genotype fields}
 	  \item $\mid$ : genotype phased
 	\end{itemize}
 
-  \item GL (Float): Genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
+  \item GL (Float): Genotype likelihoods comprised of comma separated floating point $\log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
   In presence of the GT field the same ploidy is expected; without GT field, diploidy is assumed.
 
   \textsc{Genotype Ordering.} \label{genotype-fields:genotype-ordering}
@@ -515,8 +515,8 @@ \subsubsection{Genotype fields}
 
   \item HQ (Integer): Haplotype qualities, two comma separated phred qualities.
   \item MQ (Integer): RMS mapping quality, similar to the version in the INFO field.
-  \item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
-  \item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
+  \item PL (Integer): The $-10 \log_{10}$-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
+  \item PP (Integer): The $-10 \log_{10}$-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
   \item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).
   We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality.
   \item PS (non-negative 32-bit Integer): Phase set, defined as a set of phased genotypes to which this genotype belongs.
@@ -544,13 +544,14 @@ \subsection{VCF tag naming conventions}
 \begin{itemize}
     \item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$.
     Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL).
-    The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.\ PL).
+    In some cases the likelihood may also be represented using a positive value in a separate tag (e.g.\ PL) using the $-10 \log_{10}(probability\_of\_being\_correct)$ scale.
+    In this case they may also be normalised so the most likely event has a score of 0.
 
     \item The `P' suffix means \emph{probability} as linear-scale probability in the posterior distribution, which is $\Pr(\mathrm{Model}|\mathrm{Data})$. Examples are GP, CNP.
 
     \item The `Q' suffix means \emph{quality} as log-complementary-phred-scale posterior probability, $-10 \log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$, where the model is the most likely genotype that appears in the GT field.
     Examples are GQ, CNQ.
-    The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number).
+    The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number with $QUAL = -10 \log_{10}(probability\_of\_being\_incorrect)$).
 \end{itemize}
 
 
@@ -640,8 +641,8 @@ \section{FORMAT keys used for structural variants}
 \normalsize
 These keys are analogous to GT/GQ/GL/GP and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined).
 CN specifies the integer copy number of the variant in this sample.
-CNQ is encoded as a phred quality $-10log_{10}$ p(copy number genotype call is wrong).
-CNL specifies a list of $log_{10}$ likelihoods for each potential copy number, starting from zero.
+CNQ is encoded as a phred quality $-10\log_{10}$ p(copy number genotype call is wrong).
+CNL specifies a list of $\log_{10}$ likelihoods for each potential copy number, starting from zero.
 CNP is 0 to 1-scaled copy number posterior probabilities (and otherwise defined precisely as the CNL field), intended to store imputed genotype probabilities.
 When possible, GT/GQ/GL/GP should be used instead of (or in addition to) these keys.
 
@@ -2085,6 +2086,7 @@ \section{List of changes}
 \subsection{Changes to VCFv4.3}
 
 \begin{itemize}
+\item Clarify distinction between Phred ($-10 \log_{10}(p\_of\_incorrect)$) and $-10 \log_{10}(p\_of\_correct)$.
 \item More strict language: ``should'' replaced with ``must'' where appropriate
 \item Tables with Type and Number definitions for INFO and FORMAT reserved keys