Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on absence of ? or . in MM tags #654

Closed
jrobinso opened this issue Jun 16, 2022 · 15 comments
Closed

Clarification on absence of ? or . in MM tags #654

jrobinso opened this issue Jun 16, 2022 · 15 comments

Comments

@jrobinso
Copy link

This issue arose implementing support for MM in IGV. There is an apparent discrepancy between the spec and current practice when interpreting MM tags which do not contain a "?" or "." flag. The spec says that when this flag is not present it should be assumed the modifications at skipped bases are present with low probability (i.e. equivalent to "."). However in files produced to date without this flag the opposite is true, i.e. nothing is known about skipped bases. Consequently this is how IGV interprets skipped bases.

So the request would be to change the spec to conform to current practice, or maybe more specifically existing files from past practice since the major vendors are updating their tools to always include either ? or .

@amwenger @marcus1487 jump in if I've misunderstood.

@marcus1487
Copy link

I support this clarification in the specification. This should only effect legacy versions of the tag, as I would hope current/new tags would include the ? or . specifier. And at Nanopore we have just recently started to output all-context 5mC calls. So all old Nanopore files will default to the ? tag version since calls are not made at any non-CG contexts.

@amwenger
Copy link

@jrobinso, your explanation is correct for PacBio's methylation caller, and I would support modifying the spec so that ? mode is default.

The currently released version does not specify a mode but uses ? in practice, only providing calls at CpG sites. The next release will be be explicit about ? mode.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 16, 2022

I'm not convinced in changing the spec as using ? mode means explicitly specifying the non-called bases as having confidence 0, which doesn't happen in all data as far as I can see.

At the time of writing the spec I think I had a grand total of one real called read (it was somewhat chicken and egg), but that was called end to end and non-methylated bases were simply absent.

Looking at one of the PacBio called data sets, I do indeed see it's not specifying . or ?, but it appears to be a hybrid between the two. It has long regions of no call, and then calls bases within a region without explicit mode. Eg:

Mm:Z:C+m,2932,0,5,2,2,2,6,6,3,3,3,0,...
Ml:B:C,254,255,223,247,255,255,187,232,175,7...

So the 2932 skip is almost certainly skipping an initial region (maybe related to 4533S in the alignment? I haven't counted the Cs but it seems far too many), and then it's reporting methylation status for bases that it thinks have been methylated only. That bit at least is adhering to the spec as written. I don't see many 0 confidence values in the Ml tag. So I think the problem here is the partial calling over a specific range.

If we start changing to spec to fit the incorrect software then we'll probably then get a complaint from another vendor saying that by changing the defaults we've altered the meaning of their data which doesn't write out those runs of 0s.

Edit: if we change this to explicit ? mode then we're saying the inbetween bases haven't even been investigated and have unknown methylation status. This doesn't seem likely to be true. 87% of the bases called have probability >= 0.2, and 77% have P >= 0.5. Unless virtually every C in this data is methylated, it's very very probable that it's simply not recording the bases it thinks have vanishingly low probability of being methylated. Additionally I see in the initial Mm tags (for the first 121 reads I checked) some 126K Cs skipped by the first Mm number and some ~37K with recorded probabilities. (There's no large terminating number, so it's not explicitly specifying a count to skip there.) I see 627K Cs in the reads, so even accounting for the initial skip and assuimng a comparable unrecording terminating skip, there are a huge number unaccounted for. I'm pretty certain it's not using an explicit mode of "if not recorded, then it's not been checked".

@jrobinso
Copy link
Author

jrobinso commented Jun 16, 2022

I was just pointing out that there is no known software or existing data files that produces files with the "." default assumption, on the contrary all known software and files in existence use the "?" default assumption.

Its possible I have misread the spec, but I did I confirm my reading with @amwenger and @marcus1487 . Under the "." assumption I would assume all skipped "Cs" (e.g. the first 2932 in your example) would be known to have the modification with low probability. But you say that example was written to the spec. So perhaps some clarification to the spec is in order, rather than a change. @amwenger could you comment on that example?

if we change this to explicit ? mode then we're saying the inbetween bases haven't even been investigated and have unknown methylation status

Yes! That is exactly what we are saying, that is how IGV is currently interpreting the pacbio and ont data. Only "CG" sites have been investigated. Within CG site some alignments could not be called, for various reasons, and nothing can be said about them. They are also skipped.

Edit:

I'm not convinced in changing the spec as using ? mode means explicitly specifying the non-called bases as having confidence 0, which doesn't happen in all data as far as I can see.

Actually that's what "." mode means from my reading. "?" mode means nothing can be said one way or another about non-called (i.e. skipped) bases. I don't read ML as confidence, but as probability, and an ML of zero means we are confident the modification is not present. I interpret a non-call (skipped base in "?" mode) to mean we haven't checked or don't have enough confidence to make a call. Bases that are called with low probability get a bright blue color in IGV to mean these bases are probably not methylated. Non-call bases don't get any color at all.

I'm pretty certain it's not using an explicit mode of "if not recorded, then it's not been checked".

This is the nub of the misunderstanding or issue, I am assuming exactly that, i.e. nothing can be said about those bases. BTW "been checked" is not the same as no-call. I don't think whether or not its been checked is relevant.

@jkbonfield
Copy link
Contributor

Ok if it's calling only CG motifs and not just C bases then that would explain why it has lots of gaps. Is it to be expected though that almost all CGs are infact methylated? I thought generally it wasn't so common, but I'm a novice really at the biology here.

I'm not sure we can say "there is no known software or existing data files that produces files with the "." default assumption". The first data I had came from the ONT basecaller which was spitting out a mod_mapped.bam file. I'm pretty syre that was matching the spec, but I could be wrong and maybe it's also looking only in CG cases?

@jrobinso
Copy link
Author

@marcus1487 can comment on this but all the data I have seen so far have been CG context callers. So nothing can be said about other "Cs" ("Cs" not in CG contexts). That's what prompted this issue, if I interpreted these files to the spec all of those Cs would get colors indicating they are not methylated which makes no sense.

Methylation is very common but just because a call is made does not mean that the site is methylated. It could equally mean it is believed to be NOT methylated (ML < 50%). It just means that something is definitely being said about it, as opposed to nothing is known. A call with ML of 0 is information, a skipped base is not, and I think that is where the confusion is. This is a confusing topic, I struggled with it, but that seems to be how CG context calls are to be interpreted. I don't have any experience with other modifications.

@jrobinso
Copy link
Author

BTW the non-context specific callers, of which I have not seen any data, will use the "." explicitly to save disk space since otherwise there would be a huge number of near zero calls. To date I think almost all the release pipelines and data, at least that I have seen, are CG specific 5mc callers.

@jkbonfield
Copy link
Contributor

That was the whole intention when designing the format - to avoid doubling up on space by storing an extra per-base methylation channel, potentially with many new calls in the future. Not being sufficiently familier with the biological processes though I didn't think that algorithms wouldn't even look for signal outside of CpG islands, but it makes sense given the bulk of methylation occurs there in mammals and it's an easy way to improve "accuracy" (in a somewhat biased fashion).

I'm stil wary of just changing the definition this late though. Given how hard it was to reach the correct bits of the community during the lengthy review (basically I failed to do so despite trying), I don't really have much confidence we can also reach enough of it again now to really know who is doing what.

Maybe someone closer to the problem can enumerate all the platforms that are capable of producing such data and describe the state of their pipelines, along with who to contact? We're not going to reach the necessary audience by posing questions here (otherwise we wouldn't be in this situation to start with).

@jrobinso
Copy link
Author

jrobinso commented Jun 16, 2022

I understand your reluctance to change the spec. OTOH I can't change IGV to conform to the spec as it then wouldn't work with current data. And AFAIK @amwenger and @marcus1487 represent the only platforms currently capable of producing this data.

The only time "." makes sense, as far as I understand, is for context free callers. I don't think there are any of these released yet, some are in development. For context specific callers (e.g. CpG) "?" is the only reasonable interpretation as there are many more "Cs" (for example) out of context than in context. In other words the ratio of skipped to called bases is high. Its not reasonable to treat these skipped bases as having been investigated.

None of this matters unless you are distinguishing a no-call from a call at low probability. This is subtle but important for methylation at least. A call with low probability == a call of no methylation with high probability.

This is not a huge problem as both platforms will be specifying ? or . going forward, my concern is someone someday will produce files that do conform to the spec in this regard in which case IGV (and possibly other tools) won't work properly. But that can be dealt with if it occurs. The best practice would be to always specify this and not rely on defaults. I just raised this issue as it was clear that spec and practice were out of sync, feel free to close it.

@marcus1487
Copy link

I can confirm that all nanopore production 5mC results should use the ? default as they are CG-context models. We have a first all-context 5mC research model released, but this is not a production release at this time. Given that many nanopore calls were made before the update to the spec to include the tag specifier I would agree that an update to the format to use ? as the default would work bets for all nanopore data produced.

Is it to be expected though that almost all CGs are infact methylated? I thought generally it wasn't so common, but I'm a novice really at the biology here.

Native human (and most native mammalian samples) have about 50% of Cs in CG contexts modified and ~1-5% of Cs in non-CG contexts modified. Plants have more CHG methylation (I don't know enough about plant CG context methylation).

The first data I had came from the ONT basecaller which was spitting out a mod_mapped.bam file. I'm pretty syre that was matching the spec, but I could be wrong and maybe it's also looking only in CG cases?

Example data provided from nanopore at the time of development should have been CG-context only.

From the nanopore side I would recommend @jts and @cjw85 might want to comment.

@cjw85
Copy link

cjw85 commented Jun 17, 2022

A clarification on results from Oxford Nanopore Sequencing basecallers.

The previous generation of calling algorithm was capable of performing all-context calling. There are files in the wild where the tags follow the specification with no optional . or ?, to the letter; that is to say they have "missing" implied entries from the tag which can safely and confidently be interpreted as low modification probability, not just that no call was made. (#582 (comment))

There are clearly some BAMs in the wild that do not use the optional tags but are encoding their data as if they were using the explicit ? mode. I don't think the specification is perfect (and is now suffering a historical wart of the implicit/explicit mode flag being optional), but neither do I think we should rush to change it.

Similar to @jkbonfield I don't hold with merit the argument that the specification should be changed to fit erroneous files that have been produced thus far.

@jrobinso
Copy link
Author

jrobinso commented Jun 17, 2022

There are files in the wild where the tags follow the specification with no optional . or ?, to the letter; that is to say they have "missing" implied entries from the tag which can safely and confidently be interpreted as low modification probability, not just that no call was made.

This is news to me, and changes the whole discussion. In the unlikely event a user loads such a file in IGV they will get unexpected results, but I can deal with it there. As I raised this issue I will close it.

@jkbonfield
Copy link
Contributor

So this leaves us in a pickle, with both defaults being used at some point. Blergh!

I think what this really means is several things:

  1. The spec should recommend to explicitly state which model is in use; implicit vs explicit, but the default can be left as-is.
  2. Tools like IGV may need an toggle somewhere to manually flip the default to cope with erroneous data.
  3. We need the authors of existing tools to fix them so they state the model and don't assume.

@cjw85
Copy link

cjw85 commented Jun 17, 2022

I would support a change to the specification that it recommends that SAM writers explicitely state . or ?.

@jts
Copy link
Contributor

jts commented Jun 17, 2022

I'm also in favour of recommending the flag is explicit. Maybe we should write a tool that performs sanity checks on mod bams to catch these issues. It could check for hard clips (#646), that all Cs (or whatever the canonical base is) are accounted for in MM, and simple summary stats for the number of modification/canonical/skipped calls for each context (CA, CC, CG, CT, etc)

jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jun 22, 2022
The '.' code is the default interpretation, but historically tools
omitting '?' and '.' have used both styles.  An explicit definition in
the MM string removes any ambiguity.

See samtools#654 comments for background.
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Aug 15, 2022
The '.' code is the default interpretation, but historically tools
omitting '?' and '.' have used both styles.  An explicit definition in
the MM string removes any ambiguity.

See samtools#654 comments for background.

Co-authored-by: John Marshall <jmarshall@hey.com>
jmarshall pushed a commit that referenced this issue Aug 17, 2022
The '.' code is the default interpretation, but historically tools
omitting '?' and '.' have used both styles.  An explicit definition in
the MM string removes any ambiguity.

See #654 comments for background.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants