Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MM tag and hard clipped alignments #646

Open
jts opened this issue May 18, 2022 · 30 comments · May be fixed by #714
Open

MM tag and hard clipped alignments #646

jts opened this issue May 18, 2022 · 30 comments · May be fixed by #714
Assignees

Comments

@jts
Copy link
Contributor

jts commented May 18, 2022

The MM tag describes the modification string relative to the sequencing read as generated by the instrument. For instance, consider this read, where only the 3rd C is modified:

read_a    ACCTCGCCA

If the read is aligned end-to-end, the CIGAR, SEQ and modification tag would be 9M, ACCTCGCCA and MM:Z:C+m,2. If the mapper has hard clipped the first two bases of the read however then the modification tag is the same but the CIGAR and SEQ will be 2H7M, CTCGCCA. A naive interpretation of the MM tag will indicate that the wrong base is modified (in this case, the 4th C from the original read).

This situation is not currently handled by the MM tag specification. I would suggest discouraging the use of hard clips (e.g. suggest the -Y flag is used for minimap2) and saying something like "MM tags are invalid when the alignment contains hard clips".

@jmarshall
Copy link
Member

The topic of hard clipping came up during the original PR discussion, in #418 (comment) and the following comments. I believe we concluded that using MM:Z:N+m,4 to represent this modification of the 3rd C would be immune to hard clipping by a non-MM-aware mapper (from #418 (comment), which I think is the last clipping related comment):

I think I'm satisfied that using the N-based notation works around [the hard-clipping problem]. So all we would need to do is add a sentence recommending that only N be used if the data is to be processed and potentially hard-clipped by a non-MM-aware tool.

However I don't think we ever added such a sentence.

That would be a more generous alternative to your suggested sentence disallowing MM+hard clipping entirely.

@jts
Copy link
Contributor Author

jts commented May 18, 2022

Thanks @jmarshall, I missed that when reviewing the original discussion before posting this issue. I agree using N works around this issue.

Is the mapper allowed to modify/fix MM tag when hard-clipping the reads? If so a downstream consumer of mod bams will need a way to determine whether the non-N MM tags for hard-clipped reads are reliable or not.

@jmarshall
Copy link
Member

jmarshall commented May 18, 2022

You're right that — either way — this needs more details in the spec.

The N notation enables non-MM-aware mappers to hard clip reads and leave MM as is, such that downstream tools that do understand MM can parse MM-with-Ns correctly (on the understanding that the N counts include the bases within 2H in their counting).

What MM-aware mappers should do is another question. For either exact-base-MM or MM-with-Ns, they would be capable of adjusting the counts, so that downstream tools could parse either correctly by considering the 2H bases not to be included in the counts.

These two are inconsistent… What to do?

  • Disallow the combination of hard clipping and MM
  • Specify that H bases are never included in the MM counts, and accept that non-MM-aware mappers will corrupt MM on reads that they hard-clip (currently the spec is not explicit whether hard-clipped bases are included in the counting)
  • Specify that for MM-with-Ns, H bases are counted, but for MM-with-exact-bases, H bases are not counted (the inconsistency is not attractive!)
  • Add another optional flag alongside . and ? that says “the counting starts after any hard-clipped bases”
  • Something else…

@jts
Copy link
Contributor Author

jts commented May 20, 2022

I've been mulling this over for a few days and leaning towards adding another flag. Without a flag the counts can be relative to the original (possibly unknown) sequence, and when the flag is set the counts are relative to SEQ instead.

@jkbonfield
Copy link
Contributor

jkbonfield commented May 23, 2022

This is one of those things which just becomes akward once hard clipping is used. We could state that if hard clipping is to be used then the N notation must be used, but it's punting the problem elsewhere and you'd need to set a flag in the methylation caller (which may not even exist) before hand, meaning the person doing the caller must know of the aligner options a downstream user may use. That may not be possible if your sequencing is split up into a data provider (eg a sequencing centre) and a data consumer (some research group). It's just messy!

I think it's best to recommend that hard-clipping isn't used. We can and should also be explicit with how to interpret the data when something wrong happens. So either something like this:

If an MM tag exists on a record with hard clips in the CIGAR field, the MM tag must be ignored.

Or if we wish to permit the possibility of working around with the N code:

Base numbering in the MM tag refers to the original sequence produced by the base caller. If the sequence has been hard-clipped using the H CIGAR code, then the MM tag is only valid if it uses the N fundamental base. In this case the size of the soft-clip must be used when computing the position along the SEQ field.

I prefer the former as it's simple, and the N trick feels a bit like a hack (similar to how someone would have to handle e.g. U2/E2, although AFAIK they're never used). Frankly hard-clips only tend to occur with secondary alignments, but I guess there may still be some desire to know base modification statuses for multiple alignments?

This does however mean there's no way for an MM aware aligner to work properly, which I assume is the reason behind @jmarshall 's comment above about an additional flag. If we add this, then we'd need a similar language to above to describe the flag-less state (ie it's invalid so ignore), as well as a description of the flag so we can document how to do things properly. So I could get behind this additional flag proposal too.

Care to suggest any recommended code? It could be something inline like ? and ., or it could be a separate general purpose aux tag to indicate the starting position for the MM tag. Eg for where only part of a sequence has been analysed, but also for where the sequence is part of a larger unrecorded fragment.

Edit: or (ghastly!) actually the only additional information we need when hard clipping is how many of each base type were clipped. I don't like where that's going though, and just starting so-many bases into the original unclipped fragment is a cleaner approach. Either way the presence of such data, as is the presence of an additional character flag, is what distinguishs the hard-clipped record with MM tag from being valid or invalid.

@PanZiwei
Copy link

PanZiwei commented Mar 3, 2023

Correct me if I am wrong:

ML tag can’t show the correct number of how many seq bases of the stated base type. In other words, ML tag DOESN'T know whether there are skipped base sites after the last modified base. So We need improvement for the ML tag

I will simplify the question with the assumption that the modified base is C.

So there should be a relationship between 1) The number of skipped C sites Nskip a) The sum of items saved in MM tag SUM(MM_tag)

Based on my observation, there is a potential issue here: ML tag can’t show whether there are skipped C sites after the last modified C.

For instance, if there is only 1 single C is modified and it is the 3rd C that is modified, I think the sequence ACCTCGCCA and ACCTCGA will have the same MM tag MM:Z:C+m?,2. In other words, MM:Z:C+m?,2 might have multiple sequence correspondence. (Correct me if I am wrong).

So currently the current relationship is: Nskip >= SUM(MM_tag)

If that’s the case, I think an extra number should be introduced as the last element of the MM tag to label the number of skipped C sites to distinguish the circumstances. In thise case, the ACCTCGCCA should be MM:Z:C+m?,2,2 (2 skipped C after the 3rd modified C), and ACCTCGA should be MM:Z:C+m?,2,0 (No C after the 3rd modified C). The corresponding ML will stay the same with only 1 element ML:B:C: 256

So the question will become: How to solve the current inconsistency due to the MM tag issue? Is there a way to use ML tag to show the correct number of skipped C in the prediction?

Thank you so much for your help!

@jkbonfield
Copy link
Contributor

You are incorrect in this interpretation.

The length of items in the ML tag should match the length of items in the MM tag, if both are present. Basically MM defines the bases while ML defines their confidence, just like traditional SEQ and QUAL.

Where things get a bit different is whether the MM tag is generated by checking all sites on the read (of a specific base type) or just a subset / specific sites.

If specific sites: we record in MM the bases we inspected, and use ML to indicate whether or not we believe they are modified.

All sites: we inspect all bases. We could do as per the specific-sites strategy above and record all these bases (essentially MM of 0,0,0,0...) and use ML to distinguish which are modified and which are not. However it's usually preferable to simple record in MM which of these we believe has a significant likelihood of being modified.

This distinction is where the implicit vs explicit notation came from. It wasn't initially in the specification as at the time of drafting it we weren't aware that people would only do base modification assessment in a subset of the read, believing instead it to be an inherent part of the base-calling and either always-on or always-off.

I'm sorry but I don't understand your second point. Perhaps it's related to orientation however. We use MM to mark a particular count of bases, say 3rd and 8th C. However this is from the 5' end, even when reverse complemented. We don't need to know how many additional bases there are between the last call and the end of the sequence. See the "specific" vs "all" above for why.

@PanZiwei
Copy link

PanZiwei commented Mar 3, 2023

Thanks for the reply. Let me refine the question: I want to know the coordination of called specific site in the original sequence. It seems that the called site is not necessarily the last site in the sequence.

I will still use specific sites C as an example. I indeed observed that the length of items in the MM tag matches the length of items in the ML tag when they are both present.

Meantime, MM/ML tag can both mark the number of called C (length of items in the ML tag) and the number of skipped C (The sum of items saved in the MM tag) before the last call. In other words, the length of items in the ML tag + the sum of items in the MM tag is equal to the number of the C from the 1st appeared C to the last called C in the original sequence, but there might be additional C between the last called C and the last C in the sequence.

If the sequence is XXCXXCXCXXCXCXXCXCXCXXCXCXXCXX (containing 11 C), and giving MM:Z:C+m?,2,4; ML:B:C,256,256. I can get 3 information (Correct me if I am wrong):

  1. From MM tag: 2 C are skipped, then 3rd C is called, then 4 C are skipped, then 8th C is called. So in total 2+4=6 C sites are skipped (the sum of items in the MM tag). The modified C location in the 3rd C and 8th C. And 8th C is the last called C ( 8=2+4=the length of items in the ML tag + the sum of items in the MM tag)
  2. From ML tag: the 3rd C and 8th C have 256/256=100% chance of being 5mC
  3. From ML tag and C counts in the original sequence: Since the original sequence has 11 C, there are 11-8=3 additional Cs are between the last call (8th C) and the end of the sequence.

@jkbonfield
Copy link
Contributor

Ah yes, that's true there may be additional Cs. The intention of the tag is not to act as a counter for the base types, but to identify specific bases as being methylated. There is nothing inconsistent here, just a different intention.

What is the reason for wanting to know the remaining C count? Is it to act as an error check? I can see, had we included the remainder somewhere, then we could spot disparities between MM and SEQ, eg due to hard clipping.

@PanZiwei
Copy link

PanZiwei commented Mar 3, 2023

What is the reason for wanting to know the remaining C count? Is it to act as an error check? I can see, had we included the remainder somewhere, then we could spot disparities between MM and SEQ, eg due to hard clipping.

I am trying to align the multiple alignments with the primary alignment for downstream analysis, and I utilized the MM tag as an error check as you mentioned, and clipping will also be involved. But previously I am not sure 1) whether the MM tag count from the 1st C 2) whether there are additional Cs after the last called C.

Now after your clarification, I think I know how to do it.

Thank you so much for your help!

@jkbonfield
Copy link
Contributor

jkbonfield commented Mar 7, 2023

Mulling over ideas on twitter with others, and in the htslib PR to reject MM on hardclipped reads (samtools/htslib#1567), I had the following thought process:

  1. Some tools correctly produce MM after having done hard clipping, so MM is counted correctly.
  2. Other tools write MM and then do hard clipping of SEQ, leaving MM invalid.
  3. There is no way to be able to tell these two apart, other than looking for position overflows. Rejecting the tag on reads with hard-clipping is penalising people who do things in the correct order or implement modification-aware hard-clipping tools.
  4. If we amend MM with an additional flag to distinguish between points 1 and 2 above, then we still run the risk of an additional tool modifying hard clips further, that doesn't update MM, so we're back to square 1 (point 2).
  5. This means we need an additional check, the most obvious being the length of the SEQ field at the time MM was produced.
  6. Putting an additioinal length into MM would break existing parsers, and changing the format isn't something we should be doing.

Which leads me to a final conclusion of:

  1. Add a new tag e.g. MZ:i:<seq_len> which is defined to be the length of the SEQ field at the time the MM tag was produced. Validaters can then just check MZ vs strlen(SEQ). If MZ isn't present and hard-clips are visible, then reject it, otherwise it's likely fine.

Thoughts?

@PanZiwei
Copy link

PanZiwei commented Mar 9, 2023

I think the new tag might be a good idea to check whether a read is hard-clipped or not, but it might also increase the storage. - Not sure if it will be a concern or not because the bam format is highly condensed, but I assume that introducing one more tag might not be a big issue.

@jkbonfield
Copy link
Contributor

A new tag is just a single number, so even uncompressed it's just 5-7 bytes per read, and less when compressed. I think that's largely an irrelevance for long-read data.

@jts
Copy link
Contributor Author

jts commented Mar 9, 2023

I think adding MZ is a reasonable solution. If we're going to consider adding new supporting tags I think it is worth running through some alternatives:

  • MS for Modification Seq - the SEQ that MM was generated against (could be * if matches SEQ modulo strandedness). This bloats the file and may cause issues when interpreting the alignment if SEQ is not a substring of MS. On the plus side it would be very simple to append MS into hard-clipped records to support parsing the modifications of such records without the aligner needing to be MM-aware.

or

  • MH for Modification Hardclips - provides the prefix/suffix hardclipped sequence needed to recover the original SEQ (say MH:z:ACGTA,CCA for a read with 5H...3H.

On the balance though MZ is probably the best way to solve the immediate problem while we think of a longer term solution.

@jkbonfield
Copy link
Contributor

Adding seq to MH is basically the same as switching hard-clips to soft-clips. There's really no point to doing the hard clipping if we're just going to store it somewhere else. Remember the reason for hard clipping is that sub-string secondary alignments on long-read data can be vast. It's not uncommon to see 10s of KB of hard-clipped data. Adding all that back again isn't really an option IMO. Or if it is, the option is to soft-clip not hard-clip.

We perhaps just record ACGT base counts for the trimmed data as that's all you really need to correct MM. However, if we have a M*-tag aware hard-clipping tool then it should just hard-clip MM/ML too and trim out the modifications that fall outside the clipped range, making MH/MS totally redundant.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Mar 29, 2023
If a sequence is hard-clipped after calling the base modifications,
then the tool may, or may not, update the MM and ML tags accordingly.
We have no way of distinguishing these two cases.  While the base
modification parsing code already detects overflows where the
coordinates go beyond the sequence end, this isn't fool proof,
especially if the clipping is short.

So instead we have an (as yet unwritten) proposal of MZ:i tag holding
the sequence length, to be written at the same time as the MM and ML
tags.  This can then be used as a sanity check later on, to detect
cases where the sequence has changed length via a tool that is unaware
of base modifications.

TODO: as a separate PR, we should add a new API that can trim bases
off the start/end of MM/ML strings to make it trivial for tools that
are doing hard clipping via htslib.  (Indeed we don't even have an API
for SEQ/QUAL either, so it can do all together).  This would make it
far easier for people to keep everything in sync, and this code could
then also update MZ while it's at it.  That's new API though so it can
arrive as a separate commit.

See samtools/hts-specs#646
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Mar 29, 2023
This is used as a sanity check on the validity of the MM and ML tags.
It holds the length of SEQ at the time MM and ML were produced and/or
updated.  The intention is to provide a mechanism to detect
hard-clipping has been performed with a tool that is not MM/ML aware.

Fixes samtools#646
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Mar 29, 2023
If a sequence is hard-clipped after calling the base modifications,
then the tool may, or may not, update the MM and ML tags accordingly.
We have no way of distinguishing these two cases.  While the base
modification parsing code already detects overflows where the
coordinates go beyond the sequence end, this isn't fool proof,
especially if the clipping is short.

So instead we have an (as yet unwritten) proposal of MZ:i tag holding
the sequence length, to be written at the same time as the MM and ML
tags.  This can then be used as a sanity check later on, to detect
cases where the sequence has changed length via a tool that is unaware
of base modifications.

TODO: as a separate PR, we should add a new API that can trim bases
off the start/end of MM/ML strings to make it trivial for tools that
are doing hard clipping via htslib.  (Indeed we don't even have an API
for SEQ/QUAL either, so it can do all together).  This would make it
far easier for people to keep everything in sync, and this code could
then also update MZ while it's at it.  That's new API though so it can
arrive as a separate commit.

See samtools/hts-specs#646
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Apr 6, 2023
This is used as a sanity check on the validity of the MM and ML tags.
It holds the length of SEQ at the time MM and ML were produced and/or
updated.  The intention is to provide a mechanism to detect
hard-clipping has been performed with a tool that is not MM/ML aware.

Fixes samtools#646
vasudeva8 pushed a commit to samtools/htslib that referenced this issue Apr 6, 2023
If a sequence is hard-clipped after calling the base modifications,
then the tool may, or may not, update the MM and ML tags accordingly.
We have no way of distinguishing these two cases.  While the base
modification parsing code already detects overflows where the
coordinates go beyond the sequence end, this isn't fool proof,
especially if the clipping is short.

So instead we have an (as yet unwritten) proposal of MZ:i tag holding
the sequence length, to be written at the same time as the MM and ML
tags.  This can then be used as a sanity check later on, to detect
cases where the sequence has changed length via a tool that is unaware
of base modifications.

TODO: as a separate PR, we should add a new API that can trim bases
off the start/end of MM/ML strings to make it trivial for tools that
are doing hard clipping via htslib.  (Indeed we don't even have an API
for SEQ/QUAL either, so it can do all together).  This would make it
far easier for people to keep everything in sync, and this code could
then also update MZ while it's at it.  That's new API though so it can
arrive as a separate commit.

See samtools/hts-specs#646
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue May 2, 2023
This is used as a sanity check on the validity of the MM and ML tags.
It holds the length of SEQ at the time MM and ML were produced and/or
updated.  The intention is to provide a mechanism to detect
hard-clipping has been performed with a tool that is not MM/ML aware.

Fixes samtools#646
vasudeva8 pushed a commit to vasudeva8/htslib that referenced this issue May 30, 2023
If a sequence is hard-clipped after calling the base modifications,
then the tool may, or may not, update the MM and ML tags accordingly.
We have no way of distinguishing these two cases.  While the base
modification parsing code already detects overflows where the
coordinates go beyond the sequence end, this isn't fool proof,
especially if the clipping is short.

So instead we have an (as yet unwritten) proposal of MZ:i tag holding
the sequence length, to be written at the same time as the MM and ML
tags.  This can then be used as a sanity check later on, to detect
cases where the sequence has changed length via a tool that is unaware
of base modifications.

TODO: as a separate PR, we should add a new API that can trim bases
off the start/end of MM/ML strings to make it trivial for tools that
are doing hard clipping via htslib.  (Indeed we don't even have an API
for SEQ/QUAL either, so it can do all together).  This would make it
far easier for people to keep everything in sync, and this code could
then also update MZ while it's at it.  That's new API though so it can
arrive as a separate commit.

See samtools/hts-specs#646
@marcus1487
Copy link

Carrying conversation over from related IGV thread.

From the ONT side, modified base calls will always be made along side basecalling. So aligners would have to either pass through or treat MM/ML tags. I would not expect that we will end up in a place where aligners would be expected to make modified base calls. We have research commands to make modified base calls after canonical basecalling is completed, but this requires some extra tags at basecalling time as well as a lot of large and random file access and thus would not recommend this workflow in general. These commands are generally only recommended for research settings (e.g. applying several different or advanced research modified base models to the same set of basecalls).

Overall, I think the MN tag is a nice solution to validating MM/ML tags. If we can get good adoption across the board I think it makes sense for downstream tools (IGV etc) to require the MN tag. We can consider adding a command to modkit to add this tag to reads for backwards comparability of old files without this tag.

@lh3
Copy link
Member

lh3 commented Nov 17, 2023

I was directed from igvteam/igv#1435 as well. I will explain my preference first. I propose to add a new MC:A:Y tag. If this tag is absent, MM uses offsets on the original read sequence. If this tag is present, MM uses offsets on the SEQ field flipped to the read strand.

The root cause of the problem is that the spec doesn't say how to interpret MM in the presence hard clippings. Minimap2 thinks MM uses offsets on the original read, while IGV assumes MM uses offsets on the SEQ coordinate. To address the issue, the spec shall clearly specify what MM means when there are no additional tags. I see two possibilities:

  1. When there are no other tags, MM uses the original read coordinate.
  2. When there are no other tags, MM uses the SEQ coordinate.

We are discussing solution 2 only because IGV makes that assumption. I see no other benefit. It instead incurs several problems: a) Imposing 2 invalidates minimap2 which currently doesn't violate the spec. It turns makes existing minimap2, which conforms to the spec at present, non-conforming and thus breaks the backward compatibility of the spec. b) Due to the co-existence of new and old versions, we will see MM produced by both rules in coming years and we can't tell which rule is in action. This is bad for everyone. c) Existing tools that clip MM for minimap2 alignment assume rule 1. Imposing 2 will break them.

Imposing rule 1 will affect IGV and existing tools that clip MM. These are easy to fix. For IGV, subtract the hard clip length. For tools that clip MM, output MC:A:Y and call it done. Again, making minimap2 adopt rule 2 will force more tools to change in more complex ways.

The MN tag in #714 could work if it defaults to the original read length (this adopts rule 1). Nonetheless, MN brings a new complication: what if the original read length, SEQ length, and MN length are all different? We only have two scenarios here. My proposed MC:A:Y tag is simpler and less likely to cause inconsistent SAM.

@jmarshall
Copy link
Member

jmarshall commented Nov 17, 2023

MC has been Mate Cigar since 2014, so this alternative proposal would need to be altered to fit in with the spec's other standardised tags that already exist. Otherwise it is at best essentially equivalent to the MN existing proposal.

I agree that we should look again at #714's handling of the missing-MN case and revisit or at least expand on the recommended behaviour.

@jkbonfield
Copy link
Contributor

I'm sorry I still don't get it.

What use is knowing that MM applies to the original sequence when the recorded SEQ differs, other than the knowledge that MM Is now invalid.

I did think quite a lot on this before proposing MN, and realised that your MC idea doesn't work. Imagine this scenario.

  • We have a file with MC:A:Y (accepting that it'll be something different to MC for reasons John mentioned, but let's go with it for now). This was emitted by some tool that modified the sequence in some manner. Maybe some UMI tag removal tool, or similar.

  • We now pass that data through an aligner, which hard clips the sequence. It doesn't understand the M* tags, so it modifies seq but leaves MM, ML and MC intact.

The file is now fibbing. We claim MM refers to SEQ, but it clearly does not. The same would apply if the aligner knew about MM/ML, modified it, and set MC:A:Y, but then a later stage subsequently modifies SEQ again. The benefit of MN is that it encodes the one thing that really matters - has the coordinate system changed? Changing SEQ will lead to a mismatch in MN.

Also the idea of it being the aligned sequencing strand is fundamentally flawed again, for exactly the same reasons. It only works if all tools are updated to modify these tags. We did think though most of it before making the proposal. :)

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 17, 2023

The root cause of the problem is that the spec doesn't say how to interpret MM in the presence hard clippings. Minimap2 thinks MM uses offsets on the original read, while IGV assumes MM uses offsets on the SEQ coordinate. To address the issue, the spec shall clearly specify what MM means when there are no additional tags. I see two possibilities:

1. When there are no other tags, MM uses the original read coordinate.

2. When there are no other tags, MM uses the SEQ coordinate.

We are discussing solution 2 only because IGV makes that assumption. I see no other benefit. It instead incurs several problems: a) Imposing 2 invalidates minimap2 which currently doesn't violate the spec. It turns makes existing minimap2, which conforms to the spec at present, non-conforming and thus breaks the backward compatibility of the spec.

The spec does sort of hint at MM referring to the original sequence as-reported by the sequencing instrument because it categorically states that it is in the original sequence orientation and the "fundamental" base type reported for the modification is on that strand too. However I agree it could be clearer with respect to editing SEQ. Had we thought of that (beyond merely reverse complementing) then obviously we'd have thought about MN at the time too.

This isn't just a knee jerk reaction to appease IGV though. The system was designed around storing evidence and data, not around alignment viewers, and both option 1 and 2 were assumed to be the same thing simply because we didn't think about tools editing the sequence. (Thinking of the things missing from a spec is much harder than discussing the wording that is there.)

Sadly both 1 and 2 in your example are not working well, as some tools think it's 1 and some tools thing it's 2. As you say going forward this becomes bad for everyone unless there is a clear way to disambiguate these cases, hence the MN proposal. Please if you have improvements for wording or something is missing, do please comment on that PR.

Also, I'm hesitant to suggest hacks like if the MM positions refer to the original sequence (detected through some boolean tag) then we can just adjust them based on hard-clipping. It may work, but it's a blunt instrument. We don't know it is valid because we're just assuming that the only thing that modified SEQ post MM-calling was the aligner. It could be UMI processing, sequencing vector removal without hard clipping, etc. I don't want us to be in a position of making assumptions only to find them bust apart again in 6 months time. Yet another reason why a boolean "yes / no" tag is just an accident waiting to happen. We have a fighting chance of compensating for the original vs editing SEQ coordinates if we know both the original length and the new modified lengths. It's not perfect, but was the best I could come up with.

@lh3
Copy link
Member

lh3 commented Nov 17, 2023

it modifies seq but leaves MM, ML and MC intact

You only pass MM/ML to an aligner. MC encodes alignment information. You don't pass MC just like you don't pass NM.

It could be UMI processing, sequencing vector removal without hard clipping, etc.

SAM is designed to allow the retrieval of the original sequences such that they can be realigned. As long as SEQ and MM are intact on the primary record, these tools don't matter. If you can't reconstruct the original sequence and its base modifications, other solutions will be problematic as well.

Also the idea of it being the aligned sequencing strand is fundamentally flawed again

I don't have that idea. I think everyone agrees MM is relative to the read strand.

Please if you have improvements for wording or something is missing, do please comment on that PR.

We need to rewrite it. Let's replace MC with MH. In the text describing MM, add:

MM is always on the original read strand. If the MH tag is absent, offsets in MM are relative to the original read sequence. If MH is present, offsets in MM are relative to the sequence on the SEQ field on the original read strand.

Then describe MH:

MH:i – If this tag is present, MM describes the sequence on the SEQ field. If absent, it describes the original read sequence. MH only makes a difference if SEQ is shorter than the original sequence due to hard clipping. The value of MH is ignored.

I could send a PR with these several sentences but that would create another thread.

Also, I'm hesitant to suggest hacks like if the MM positions refer to the original sequence (detected through some boolean tag)

There is no perfection solution right now. I have explained the problems with forcing MM to describe SEQ. Those are worse. If we had followed the first rule, we wouldn't have had this thread and everything would have naturally worked together.

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 17, 2023

Apologies for somehow hitting edit instead of quote reply and wrecking your comment Heng! Hopefully it represents what you originally wrote now. Try 2 on reply...

You only pass MM/ML to an aligner. MC encodes alignment information. You don't pass MC just like you don't pass NM.

Says who? Aligners get the whole kit and caboodle normally - all the tags they're meant to spit out into the SAM file. You seem to be punting the problem onto someone else by saying the aligner should get a subset of the tags only, and then presumably we have to lift-over the ones we didn't send to the aligner to add them back into the SAM file? I don't get it, sorry.

Or are you saying MC is something produced by the aligner and not something we have currently? If so, we have a problem with realignment potentially, as then we need software that knows how to convert from SAM back to FASTQ but dropping certain tag types. Although... hopefully that's only primary records and hopefully they have only been soft-clipped.

    It could be UMI processing, sequencing vector removal without hard clipping, etc.

SAM is designed to allow the retrieval of the original sequences such that they can be realigned. As long as SEQ and MM are > intact on the primary record, these tools don't matter. If you can't reconstruct the original sequence and its base
modifications, other solutions will be problematic as well.

I'm assuming people are wanting MM to work correctly on secondary alignments, which is the whole reason for this problem. Primary alignments hopefully are already fine and aligners (again hopefully) aren't hard clipping them.

However you still haven't thought about non-alignment tools. Something else that takes an MM/ML tag and the SEQ and does something (it doesn't matter what - anything will do) and isn't aware of MM/ML still needs a way of signally that the SEQ and MM are now out of sync. When that software isn't being modified, the ONLY way that they can be detected as out of sync is some additional tag (or embedded MM data, but that ship has sailed now) that permits detection. Hence MN. I don't think we can get around it with a boolean aligner-specific thing. We need the detection of edited SEQ to be evident without modifying any existing software.

We need to rewrite it. Let's replace MC with MH. In the text describing MM, add:

    MM is always on the original read strand. If the MH tag is absent, offsets in MM are relative to the original read sequence. If MH is present, offsets in MM are relative to the sequence on the SEQ field on the original read strand.

Then describe MH:

    MH:i – If this tag is present, MM describes the sequence on the SEQ field. If absent, it describes the original read sequence. MH only makes a difference if SEQ is shorter than the original sequence due to hard clipping. The value of MH is ignored.

I could send a PR with these several sentences but that would create another thread.

Sorry I strongly disagree.

Maybe I'm not putting my point over correctly, so let's tackle your suggestion and work through a scenario so the problem is evident. We start with a fastq containing SEQ, MM and ML.

"If the MH tag is absent, offsets in MM are relative to the original read sequence."
So that's the assumption I wish to demonstrate as false: that MH is the indicator that MM and SEQ are in sync.

So can we think of any process that could take a FASTQ, modify the SEQ/QUAL to spit out one or more reads of differing lengths, while faithfully keeping all the aux tags it's aware of unchanged?

I don't know of such tools, but I could imagine one - our own custom barcode splitting. (This is quite possibly how this sort of thing first happened, before instrument manufacturers started supporting these protocols themselves.)

We ligate a 6mer to the start of every DNA fragment as part of the library prep to indicate where it came from - maybe some pooling strategy. We then apply the standard instrument protocol, sequence it and get the data out. The instrument is totally unaware of our own barcode shenanigans, so we get a bunch of reads, all starting with our ligated barcode.

We wish to strip that barcode off and, let's say, split the data up into barcode specific files. Eg maybe we're inventing a pooling or multiplexing strategy to put several samples into a single lane. So we then have a tool which reads the first 6 bases of each seq and splits up the fastq into a set of new FASTQs, one per barcoded pool, with the first 6 bases of seq and qual removed. We may, or may not, copy those 6 bases + qual into their own tag, but that's an irrelevance here.

What we now have is an offset of 6bp for each seq, breaking MM coordinates.

This tool isn't aware of MM, so it doesn't know to set MH.

Do you see the problem now? Any strategy you think of that requires a flag to be set when the sequence is modified inherently requires us to edit existing software or to dictate that all future software is MM aware. We need the reverse - a strategy that can detect when SEQ has been modified but MM/ML have not even when software is totally unaware of the new aux tags.

(We also need a mechanism where new software that is aware of the new tags can indicate that they're still valid and have been updated - this is the bit of the problem that you are solving I think.)

    Also, I'm hesitant to suggest hacks like if the MM positions refer to the original sequence (detected through some boolean tag)

There is no perfection solution right now. I have explained the problems with forcing MM to describe SEQ. Those are worse. If we had followed the first rule, we wouldn't have had this thread and everything would have naturally worked together.

I'll repeat this again - we are not forcing MM to describe SEQ. Instead we are enabling a mechanism where we can detect they are not in sync so we don't make mistakes. I also don't get what you mean by "followed the first rule". (Don't mention fight club?)

Fundamentally we wouldn't want to start from here, but here we are (largely because of lack of engagement earlier on when MM was first proposed). The best way forward is simply a way that can

A) detect when there are problems and SEQ / MM are not in sync

and possibly B) describe by how much they are out of sync so we can check whether it matches known hypothesis as to the cause, such as hard-clipping amounts.

Your MH tag doesn't fufill either of these while the existing MN proposal does. I'm open to other ideas that do fulfil these requirements though.

@lh3
Copy link
Member

lh3 commented Nov 17, 2023

We are going around the circle. I understand what you said but most of them are not relevant.

detect when there are problems and SEQ / MM are not in sync

That is the fundamental difference between us. You assume MM and SEQ should be in sync, while I think MM and CIGAR should be in sync; as long as MM and CIGAR are synced, you can get correct methylation on SEQ.

Anyway, note that I have said "The MN in #714 could work if it defaults to the original read length". I can live with MN provided that the spec defines MM to be on the original sequence coordinate in the absence of MN/MH.

@lh3
Copy link
Member

lh3 commented Nov 17, 2023

Back from a meeting. If you want to go with MN, I would modify the spec to the following. In the text describing MM:

  • MM is always on the original read strand. If the MN tag is absent, offsets in MM are relative to the original read sequence. If MN is present and equals to the length of the sequence on the SEQ field, offsets in MM are relative to SEQ on the original read strand. If MN is present but different from the length of SEQ, MM is invalid.

Then describe MN:

  • MN:i – Length of sequence at the time MM/ML were produced. The value defaults to the length of the original input sequence, which can also be calculated from CIGAR.

Consequence of this proposal: minimap2 stays the same. IGV needs to be updated. Users no longer need to clip MM. If a pipeline prefers to clip MM to reduce storage, which will be minor, it can implement a step to clip MM and add MN. Samtools can provide one as well.

EDIT again: hmm... Looking at what I wrote above, I think the right solution is to just let MM defined on the original sequence without introducing new tags. This is not optimal storage-wise, but it is simple and clear. There will be confusion during the transition to this MM definition, but the confusion is unavoidable anyway.

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 20, 2023 via email

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 20, 2023

Back from a meeting. If you want to go with MN, I would modify the spec to the following. In the text describing MM:

* MM is always on the original read strand. If the MN tag is absent, offsets in MM are relative to the original read sequence. If MN is present and equals to the length of the sequence on the SEQ field, offsets in MM are relative to SEQ on the original read strand. If MN is present but different from the length of SEQ, MM is invalid.

The first statement is already in there. Some of the other bits are too (eg explicitly stating that the absence of MN means MM is considered to be valid - which is basically that SEQ is the same as the original produce), but it doesn't harm to rephrase the same thing in multiple ways. It's good to be explicit about where MM is invalid too.

Then describe MN:

* MN:i – Length of sequence at the time MM/ML were produced. The value defaults to the length of the original input sequence, which can also be calculated from CIGAR.

I can live with that, although I'd say "at the time MM/ML were produced or updated" as we won't to promote the idea that MM/ML aren't immutable.

I'd also ditch the CIGAR comment. It's simply not helpful IMO as in most (but not all) cases we need the sequence too to process MM. If we wish to include such a statement, then it should be fully explained. Eg "When a fundamental base of N is in use, the MN length may be validated against the sequence length computed from CIGAR". That permits us to keep MM working, in a restricted sense.

Consequence of this proposal: minimap2 stays the same. IGV needs to be updated. Users no longer need to clip MM. If a pipeline prefers to clip MM to reduce storage, which will be minor, it can implement a step to clip MM and add MN. Samtools can provide one as well.

No one has ever claimed minimap2 needs to be updated. Whether or not you update it is entirely up to you, or to anyone else who happens to fork it.

This proposal is simply a way of distinguishing software that does hard clipping within updating MM/ML from software that does hard clipping and understands MM/ML and updates them accordingly.

For sorted data, such as emitted by minimap2, it should be quite easy to modify MM/ML/MN on hard-clipped data and to trim them, as we'll also have the unclipped data along side it. Thank you for the idea of adding a tool to do this in Samtools. It fits naturally in with the other fixes applied during samtools fixmate. I'm sure Picard has similar things too (eg CleanSam).

EDIT again: hmm... Looking at what I wrote above, I think the right solution is to just let MM defined on the original sequence without introducing new tags. This is not optimal storage-wise, but it is simple and clear. There will be confusion during the transition to this MM definition, but the confusion is unavoidable anyway.

Not introducing any new tag and stating that it simply matches the original sequence (essentially what is in the spec already) is just tantamount to saying "it's sometimes ok and sometimes broke; shrug!" as we offer no way for the user to work out when it is valid and when it isn't. MN Is a solution to this and we'll just have to agree to disagree here. I'm not going to drop it unless a better alternative is proposed.

@lh3
Copy link
Member

lh3 commented Nov 20, 2023

If MM uses a fundamental base of A, C, G or T then we need to know more than the sequence length. We need to know how many of that fundamental base was trimmed.

Ah, I see your point. I agree to add MN. It would be better to have a new tag that defines offsets on all bases, but the ship has sailed.

@jkbonfield
Copy link
Contributor

I did think about having that as a tag, but felt it probably wasn't justified. It's still better IMO if we can get a way to trim the MM tag back, especially when dealing with long sequences with potentially several short secondary matches. A post-processing tool while the data is still name collated would definitely be possible.

@cjw85
Copy link

cjw85 commented Nov 20, 2023

A post-processing tool while the data is still name collated would definitely be possible.

I suspect @ArtRand will provide that in nanoporetech/modkit, but also seems like a natural thing to be in samtools (like fixmate, calmd, and friends).

On the broader theme, I have not been able to come up with a more compelling solution than MN as recording the length of SEQ when MM was produced --- short of tearing it all down and starting from scratch.

@jts
Copy link
Contributor Author

jts commented Nov 21, 2023

I'm also in favour of the MN tag, and agree that a tool that does streaming correction of the MM tag for hardclipped reads prior to position sorting is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Progressing
Development

Successfully merging a pull request may close this issue.

7 participants