Skip to content

Commit

Permalink
Allow for UTF-8 field values in header regular expression
Browse files Browse the repository at this point in the history
Use `[:print:]` in the header regex and note that for ASCII it is
equivalent to `[ -~]` and that the aim is to forbid control characters.
Fixes #719.
  • Loading branch information
jmarshall committed May 12, 2023
1 parent 3c493e7 commit 229e998
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ \section{The SAM Format Specification}

Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII \footnote{Charset ANSI\_X3.4-1968 as defined in RFC1345.} in using the POSIX / C locale.
Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax.
For brevity, named character classes are written as~{\tt [\cclass{class}]} without an additional pair of brackets.

\subsection{An example}\label{sec:example}
Suppose we have the following alignment with bases in lowercase
Expand Down Expand Up @@ -215,8 +216,10 @@ \subsection{The header section}
each data field follows a format `{\tt TAG:VALUE}' where {\tt TAG}
is a two-character string that defines the format and content of {\tt VALUE}.
Thus header lines match {\tt
/\char94@(HD|SQ|RG|PG)(\char92t[A-Za-z][A-Za-z0-9]:[
-\char126]+)+\$/} or {\tt /\char94@CO\char92t.*/}.
/\char94@(HD|SQ|RG|PG)(\char92t[A-Za-z][A-Za-z0-9]:[\cclass{print}]+)+\$/}
or {\tt /\char94@CO\char92t.*/}.%
\footnote{{\tt [\cclass{print}]} indicates that header field values contain printable characters, i.e.,~non-control characters.
For fields limited to~ASCII, which is the majority, this is equivalent to~{\tt [ -\char126]}.}
Within each (non-{\tt @CO}) header line, no field tag may appear more than
once and the order in which the fields appear is not significant.
Expand Down

0 comments on commit 229e998

Please sign in to comment.