Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam-tags: MM field value pattern does not allow ambiguity codes #732

Open
zaeleus opened this issue Jun 27, 2023 · 3 comments
Open

sam-tags: MM field value pattern does not allow ambiguity codes #732

zaeleus opened this issue Jun 27, 2023 · 3 comments

Comments

@zaeleus
Copy link

zaeleus commented Jun 27, 2023

This is in regard to Sequence Alignment/Map Optional Fields Specification (2022-08-17).

The base modifications (MM) field allows modifications to be either short codes or an ChEBI ID. Short codes are constrained to [a-z]+ (i.e., lowercase letters) but the table of "standard common types" lists ambiguity codes that do not match this (i.e., uppercase letters).

Unmodified base Code Abbreviation Name ChEBI
C C Ambiguity code; any C mod
T T Ambiguity code; any T mod
U U Ambiguity code; any U mod
A A Ambiguity code; any A mod
G G Ambiguity code; any G mod
N N Ambiguity code; any mod
@jkbonfield
Copy link
Contributor

The short codes are modified bases, so "m" and "h" being 5mC and 5hmC. It doesn't make any sense to have a base modification from nucleotide to ambiguity code, so I'm not sure I follow this.

We don't support ambiguity codes in the unmodified base component, so we couldn't do MM:Z:Y+h,4; for example as it wouldn't may sense. "N" covers this case anyway with the different counting regime.

@zaeleus
Copy link
Author

zaeleus commented Jun 28, 2023

I'm referring to the Code column of the standard common types table under the MM description. It defines codes that are uppercased, but the MM field pattern does not allow it: MM:Z:([ACGTUN][-+]([a-z]+|[0-9]+)[.?]?(,[0-9]+)*;)*. I referred to the short code portion as [a-z]+ originally.

The description for ML gives an example of using an ambiguous modification:

For example MM:Z:C+C,10; ML:B:C,229 indicates a C call with a probability of 90% of having some form of unspecified modification."

See that it uses C as the modification code, which does not match ([a-z]+|[0-9]+).

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 28, 2023

Oh wow I'd totally forgotten about that!

Yes, the regexp should be ([a-zACGTUN]+|[0-9]+) for the code portion. Good spot. Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants