Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methylation and hard clipping #1435

Closed
marcus1487 opened this issue Nov 15, 2023 · 41 comments
Closed

Methylation and hard clipping #1435

marcus1487 opened this issue Nov 15, 2023 · 41 comments
Assignees
Milestone

Comments

@marcus1487
Copy link

When methylation tags are supplied along with hard clipped CIGAR strings the methylation tags are rendered useless as they the counting of total canonical bases is no longer possible. It appears as though IGV may still be trying to parse these tags in this setting. Namely the default in minimap is to hard clip on supplementary alignments. There is an option in minimap to convert the hard clipping to soft clipping for supp alignments which would rescue the methylation tags. But for IGV visualization it would be ideal for the invalidated methylation tags to avoid any output at all (i.e. do not show any methylation for reads with hard clipping).

Related issue in Remora repo: nanoporetech/remora#131

@jrobinso
Copy link
Contributor

jrobinso commented Nov 15, 2023

@marcus1487 I'm not sure I am following you, why are methylation tags not valid with hard clipping? If a read is hard clipped the SEQ record should be clipped as well, and any methylation tags that reference the clipped sequence should be clipped also, otherwise I think you have an inconsistent bam file (inconsistent with the various specs).

@jrobinso
Copy link
Contributor

By methylation tags "clipped", I mean they should be consistent with the SEQ field.

@marcus1487
Copy link
Author

Everything you say is correct except that the methylation tags are generally not clipped at the alignment step (and therefore are not consistent with the SEQ field). The tags are simply "passed through" the aligner. The tags were designed to be anchored to the original SEQ field thus allowing aligners to simply pass the tag through alignment. This is consistent with the tag convention concerning reverse strand reads, where the methylation tags are not reversed even though the SEQ field is reverse complemented for reverse strand mappings. Thus these tags are not strictly against the spec, but are also not useful in isolation (without the primary mapping). Hopefully this clears up the issue a bit, but please let me know if there are specific clarifications that could help.

@jrobinso
Copy link
Contributor

jrobinso commented Nov 15, 2023

If the methylation tags are not consistent with the SEQ field I would consider that out-of-spec, from the spec doc

 This potentially differs to the sequence
stored in the main SAM SEQ field if the latter has been reverse complemented, in which case SAM FLAG
0x10 must be set

So read literally, the only case where they can differ is if the SEQ field is reverse complemented, in which case 0X10 must be set.

Regardless of spec interpretation the tags are misleading (and not useful as you say) if the SEQ field is a different sequence than the one they refer to. The best solution here I think is for the tool that produces these files to fix these tags so they are useful, i.e. they are in relation to the sequence in the SEQ field. There's not really much IGV can do here, we certainly do not want to disable these tags for all hard clipped alignments because of a specific tools default conventions.

I assume hard clipping is applied to save storage space? It has other consequences, showing, and sometimes blatting, soft-clipped sequences is one of the more popular IGV options for those studying structural variants. This information is lost if replaced with "H".

@jrobinso
Copy link
Contributor

I think we need to bring in a wider community here, from pacbio and others who might be producing these files, before doing something as drastic as disabling methylation display if an alignment is hard-clipped. Also some clarification of the SAM tag spec is in order. @ctsa any comments on this?

@marcus1487
Copy link
Author

marcus1487 commented Nov 15, 2023

I think the problem here is the aligners don't really carry support for tags in general. I raised an issue in minimap2 (lh3/minimap2#870) to support unaligned BAM/SAM input and this feature request was denied. I am guessing that this may have been at least partially due to issues such as this, but possibly @lh3 could comment more directly.

The methylation tags are enabled in minimap2 alignment using samtools fastq -T "*" to copy the uBAM tags into the FASTQ and then using the -y minimap option to copy all tags though to the output BAM file. A minimal request could be made to avoid carrying the MM and ML tags through on mappings with hard clipping. Or users can specify the -Y flag to use soft clipping in supp alignments. I general though I think enforcing that aligners parse and correct tags opens up potentially too much support burden.

@jrobinso
Copy link
Contributor

@marcus1487 Yes I agree on the support burden. But hacks for out-of-spec files opens up IGV support burdens. Hard clipping can be applied for other reasons, I assume, not just supplementary alignments. Is the proposed rule here that MM tags are never interpretable if a hard clip is used in the Cigar string?

@marcus1487
Copy link
Author

I think that a determination should certainly be made on this. We might get some traction on adding a clarification to the hts-spec on whether MM/ML tags can be allowed on reads with hard clipping CIGAR operations. There is certainly an application for this (saving the space of the extra SEQ and MM/ML tags), but the question is in practice where there actually be adoption of this? Any thoughts from @jkbonfield or @jmarshall? I'd be happy to open a related issue in hts-spec if we should move this conversation over there.

@jrobinso
Copy link
Contributor

jrobinso commented Nov 15, 2023

At a minimum some clarification on this phrase is in order, is the specific case (reverse complemented) comprehensive, or just an example of a case where the sequence referenced by the MM tag differs from the SEQ field?

 This potentially differs to the sequence
stored in the main SAM SEQ field if the latter has been reverse complemented, in which case SAM FLAG
0x10 must be set

If the presence of an "H" tag invalidates the MM/ML tags in every case then perhaps that should be noted somewhere in the spec.

@jmarshall
Copy link

jmarshall commented Nov 15, 2023

Our expectation was that, in the fullness of time, aligners and other tools doing hard-clipping would become methylation-aware and adjust MM/ML accordingly when they do their clipping. However there would be an interim period during which clipping tools would treat MM/ML just as any other unknown tag and hence not adjust it. Sadly, this interim period is ongoing.

In the original discussions, we concluded that N-based methylation notation would still be decodable after non-MM-aware hard-clipping so thought to recommend that if downstream clipping was expected then only N-based should be used. However no text to that effect was added to the spec (and it's bit of a non-answer anyway!), and for ATGC-based notation it's difficult to determine whether later clipping has occurred and invalidated MM/ML or if they are still valid.

The current discussion of the issue of hard-clipping is samtools/hts-specs#646. The response (see PR samtools/hts-specs#714) has been to add another field MN:i:<len> that records the original sequence length (and would be updated if a clipper updated MM/ML/MN) so that invalidation due to later clipping can be detected reliably (or for N-based notation, so that downstream tools know whether hard-clipped bases are measured in the MM counts).

The PR has general agreement and is likely to be merged essentially as is. Your thoughts on it as practitioners would be of interest.

@jrobinso
Copy link
Contributor

jrobinso commented Nov 15, 2023

@jmarshall @marcus1487 @lh3 @ctsa. As a practitioner I would just like a clear rule, ideally in conformance with the specs. The rule proposed here is to ignore MM tags in the presence of hard clipping. However, according to @jkbonfield (samtools/hts-specs#646 (comment))

Some tools correctly produce MM after having done hard clipping, so MM is counted correctly.

So some tools correct the tag when hard clipping, but some do not?

For IGV perhaps this should be a user preference (ignore MM tags in the presence of hard clipping).

@jrobinso
Copy link
Contributor

WRT the MN tag from the PR samtools/hts-specs#714, this seems like a simple and robust solution for validation. If I understand it correctly, if the MN value does not match SEQ length the MM tag should be considered invalid, i.e. ignored. I can implement that in anticipation of its acceptance.

@jmarshall
Copy link

jmarshall commented Nov 15, 2023

That would be the conservative approach. If N-based annotations have been used, you can do a little better.

When MN is present and its value does not match SEQ length:

  • ATGCU-based MM annotations should be considered invalid;
  • When the difference between MN and SEQ_length is the same as the length of the CIGAR H operation, N-based MM annotations can be interpreted as also counting the hard-clipped bases.

@jrobinso
Copy link
Contributor

@jmarshall Thanks. I don't think I've ever seen an N-based MM tag, when I do I will deal with it.

I still have the problem of current files, which is what this issue is about. Given that some tools DO correct the tag when hard clipping, I think the best I can do is introduce a user preference to ignore MM in the presence of H. @marcus1487 is this reasonable to you? I don't know what the default should be. My guess would be that most tools do not do this correction.

@marcus1487
Copy link
Author

Yes. This sounds reasonable to me. I agree that most tools currently do not correct the methylation tags. I would say that the better default would be to assume the tags are in corrected. I've raised the issue internally to add the MN tag to Nanopore reads. Then the default won't matter and tools fixing the tags can add the MN tag to validated hard clipped reads. This seems like a good solution given the current state of tools.

@lh3
Copy link

lh3 commented Nov 16, 2023

It is not aligners’ job to understand the meaning of optional tags. This would defeat the goal of optional tags. Requiring a tool to understand existing tags also breaks its existing backward compatibility and hurts its forward compatibility. If a new tag like MM is added in future, aligners today will not produce desired output. From a practical point view, people will use old aligners. This is inevitable. Having a mix of complete and truncated MM is worse.

@jkbonfield
Copy link

Fundamentally there is little that can be done to existing data when hard clipping. The file formats don't permit inplace editing of sequences with extra annotations (such as case sensitivity or new base codes), and most aligners would choke anyway. So that leaves an auxiliary channel for describing the modifications. Then, inherently, anything which modifies the sequences without modifying the auxilary channel is going to break the data.

We know there are some aligners which modify both sequence and aux data, and some which do not. That leaves users in a quandry - how do they know if the data is valid. We didn't think this through initially, but it lead to the subsequent addition of the MN tag so it is possible for tool chains to work out whether the MM data is in sync with the sequence or not. It's a very simple and low-cost check.

The flip side to this is perhaps we made a mistake with trying to optimise the data encoding size by making the modifications tied to a base type. It feels natural in principle, and it does shrink the data volume, but I definitely didn't think about the case of hard clipping. For that matter, nor did anyone else so I'm not beating myself up over that problem.

As @jmarshall points out above there is an avenue forward though. The specification permits annotations to be either base-type encoded (eg C+m) or any-base encoded (N+m).

@jkbonfield
Copy link

It is not aligners’ job to understand the meaning of optional tags. This would defeat the goal of optional tags. Requiring a tool to understand existing tags also breaks its existing backward compatibility and hurts its forward compatibility. If a new tag like MM is added in future, aligners today will not produce desired output. From a practical point view, people will use old aligners. This is inevitable. Having a mix of complete and truncated MM is worse.

I diagree. Aligners produce a variety of aux tags that adhere to the SAM specification. The spec and the aligners have been produced togethers over the years. It's not all the wild-west with everyone just making up their own formats and tags. Some tags are novel, but others have very clear well defined meanings. MM is one of those now, and some alignment authors have already started supporting it and "do the right thing".

The MN tag was added so people can distinguish between the aligners which corrected the data and the aligners which didn't. No one is forcing people to update their code, but obviously if people wish to align base modification data they will naturally start switching to tools which support base modifications.

@jkbonfield
Copy link

So read literally, the only case where they can differ is if the SEQ field is reverse complemented, in which case 0X10 must be set.

The base modification tags were quite deliberately written to describe the base mods on the original orientation of the sequence. This was done so that aligners could reverse complement the SEQ field without needing to understand the MM tag. I just didn't think of the hard-clipping issue at the time.

As an aside: nor did anyone else spot this problem, so I'm not beating myself up over that omission, and besides there's very little that could be done to combat it in a pleasant manner. FWIW we tried (for years!) to get community engagement when designing this tag. Almost no one wanted to be involved, preferring to sit back and wait for it to be completed and part of the spec before implementing it. Such problems could have been caught earlier if more eyes were willing to engage with spec updates.

My thought at the time was it was better to define a standard that people could follow before we get bifurcation from every vendor inventing their own way of storing the data (as ONT had already started on by adding Z to FASTQ files). With hindsight, although we could clearly see the need for this data to be stored in a canonical manner, the lack of desire from the community to engage with such standard development should probably have told me to simply stop and wait for the clamour to grow stronger out of necessity!

@lh3
Copy link

lh3 commented Nov 16, 2023

If we require to clip MM, the introduction of MM immediately invalidates all the old aligners that transfer tags. This requirement would break the backward compatibility of the spec. We shouldn't introduce a tag this way.

What is MN tag? I couldn't find its meaning in the spec or in the original github thread. Anyway, the right solution is to have a tag like MH:A:Y. If present, a consumer of MM assumes the MM tag clipped with hard clipping. If not present, the consumer should assume MM is not clipped.

@marcus1487
Copy link
Author

@lh3 the MN tag holds the length of the SEQ field when the MM tag was last updated. So for "tag pass through" aligners if the sequence is hard clipped this will not match the SEQ field length. But if an aligner does clip MM/ML tags then the MN tag would be updated along with these to confirm the validity for downstream tools. Draft tag spec if here: samtools/hts-specs#714

@lh3
Copy link

lh3 commented Nov 16, 2023

Thanks, @marcus1487. I was not aware of that thread. That could work if we flip its default behavior: when this tag is absent, downstream tools should assume MM corresponds to the full sequence regardless of hard clipping.

@jkbonfield
Copy link

jkbonfield commented Nov 16, 2023

I don't understand why you think that would solve it. If MN is absent we cannot just widly assume MM to be true, as most of the time when clipping is involved it won't be. (A few exceptions to that are the tools that correctly clip MM)

It needs to be this way around.

No MN tags means you cannot totally trust the data. (You could, maybe assume that if hard clipping was done then it's been dutifully recorded via cigar H op, but that's not guaranteed with some barcode removal tools, so it's not entirely safe to even assume the lack of hard-clipping implies the data is correct.)

Having MN tags means you can validate whether MM is in sync with SEQ. If it's not and the length difference is the sum of the hard-clips, and the MM tag was encoded with "N+op" instead of "base+op" then you can subtract the relevant end (5' or 3') of hard-clips to correct the data. Not clean, but doable. However for space reasons, I'm not aware of people using the N-notation in anger.

@ArtRand
Copy link

ArtRand commented Nov 16, 2023

+1 for the MN tag as a validation step.

I know we haven't started seeing the N+op tag, but once we do, wouldn't it make sense to require that it always carries the ? mode? Otherwise wouldn't N+m. and N+a. contain conflicting information? In the spirit of consuming only what you need, if the N+op? is seen, you can fully interpret the tag, if you see N+op. you'd need to gather up all the other tags to make sure that the implicitly canonical calls don't actually have modification probabilities associated with them.

@lh3
Copy link

lh3 commented Nov 16, 2023

most of the time when clipping is involved it won't be.

Why?

I think the right solution is: if MN is absent, MM is defined on the original read sequence regardless of clipping; if MN is present, MM is defined on the hard clipped sequence.

@jkbonfield
Copy link

jkbonfield commented Nov 16, 2023

most of the time when clipping is involved it won't be.

Why?

I think the right solution is: if MN is absent, MM is defined on the original read sequence regardless of clipping; if MN is present, MM is defined on the hard clipped sequence.

Sorry I'm still not getting this.

If clipping has happened, and we have both SEQ and MM without an MN tag, then we have zero evidence for whether the tool that modified SEQ also modified MM. It feels foolish to just naively assume it worked when the evidence so far is most tools won't.

As I mentioned above also, even if there is no evidence of clipping, it still doesn't automatically mean SEQ hasn't been tampered with, as several pipelines have code to manipulate SEQ without creating clips. Barcode extraction is one such component.

Edit: ah maybe it's a language thing. When you say "original read sequence", do you mean primary (non-secondary) alignments? For those, it MAY be OK for reasons mentioned above. The user would probably need to make that judgement call themselves though based on knowledge of the tools used in their pipeline. It's trivial for them to go through adding an MN tag to "bless" the MM if they deem it to be valid.

@lh3
Copy link

lh3 commented Nov 16, 2023

You are making the assumption that MM is defined for clipped SEQ only. That is not the case. How to write MM in the presence of hard clipping is undefined in the current spec. We just need to define it: without MN, MM shows the coordinates on the original read.

several pipelines have code to manipulate SEQ without creating clips

This should be discouraged as it will cause other problems, for example with the SA tag. EDIT: after another thought, this is actually irrelevant as long as MM is defined on the input sequence.

@lh3
Copy link

lh3 commented Nov 16, 2023

For the your "edit" – by "original read sequence", I mean the sequence without clipping. Maybe more precisely, the sequence in the input FASTQ.

@jmarshall
Copy link

This conversation is predicated on MM being one of the tag values copied over by the aligner from tags encoded ad-hoc on the FASTQ @ header line or similar. So the MM value would have been computed from the instrument's raw methylation data in some earlier stage by some bespoke program and emitted in the FASTQ.

I have been imagining that methylation-aware aligners would, in the fullness of time, generate the MM tag themselves, operating on the instrument's more raw methylation data passed to the aligner either encoded in or alongside the FASTQ. In this case, the aligner would clip its MM tags at the same time as it clipped its reads, and there is no issue.

@marcus1487 et al, in ONT's pipeline where is the MM value originally generated?

@lh3: If you think there is a specification issue here, please raise it as an issue at hts-specs. The conversation you have started here in the igv repo will be largely lost to posterity.

@lh3
Copy link

lh3 commented Nov 16, 2023

Even if an aligner supports SAM/BAM as input, it still needs to decide how to transfer MM to the output SAM. The problem is not specific to FASTQ.

As I explained above, if you demand the aligner to handle MM, you will invalidate all the old versions of aligners before MM was introduced. Additionally, due to the existence of old aligners, you will see both clipped MM and unclipped MM for years and you can't easily tell which rule is used in a BAM. This is worse than using unclipped MM only.

The current spec doesn't specify how to encode MM given hard clipping. This gives us the opportunity to define MM for the original read by default.

@jrobinso
Copy link
Contributor

If a user really wants to view base modifications for split alignments it would seem the best policy would be to use soft clipping. In any event thanks everyone for the discussion, I think further discussion should probably be moved to the spec repository, with a reference to this ticket.

@jmarshall
Copy link

jmarshall commented Nov 16, 2023

My comment was not a question of the format (SAM/BAM/FASTQ/etc) of the aligner input, but rather of which program actually constructs the MM value.

The specification maintainers and the designers of these tags are aware of the existence of old aligners. The concern about distinguishing clipped and unclipped MM is not news — please continue the specification-related conversation on samtools/hts-specs#646 instead of splitting the conversation.

@jkbonfield
Copy link

As I explained above, if you demand the aligner to handle MM, you will invalidate all the old versions of aligners before MM was introduced. Additionally, due to the existence of old aligners, you will see both clipped MM and unclipped MM for years and you can't easily tell which rule is used in a BAM. This is worse than using unclipped MM only.

Most of my reply is over on the relevant hts-specs page, which is where it should continue, but incase anyone comes here and gets the wrong end of the stick: I am not demanding anything. The specification is designed to cope with both aligners that do understand MM and those that do not, and how to differentiate between the two. We're looking for solutions basically and this rhetoric isn't helpful. If you think the proposed solution has a problem, then please comment on this PR.

@jrobinso jrobinso added this to the 3.0.0 milestone Nov 29, 2023
@jrobinso jrobinso self-assigned this Nov 29, 2023
@jrobinso
Copy link
Contributor

@marcus1487 I added the validation for MN tag (MN value == seq length) in the branch mn_tag_validation. I would like to be able to test it before merging, do you have a test dataset you can point me to? In fact, none of my test datasets seem to have any hard clipping.

In the absence of an MN tag I also added a validation comparing the # of implied bases from the MM tag vs the count of that base in the sequence itself. To my surprise this failed for 1/2 a dozen or so alignments in two of my test datasets. The sequence did not have enough "C" bases to cover the complete MM tag. I think I'm doing it right, example below.

C+m,7,1,4,9,4,2,15,2,1,0,4,14,3,5,1,2,1,2,7,2,3,12,0,2,5,3,2,1,0,1,10,1,0,9,1,6,5,0,0,4,3,6,0,1,1,1,6,0,0,4,1,2,7,4,1,3,1,2,9,0,1,4,6,11,5,3,10,1,15,0,6,1,0,2,0,2,10,8,1,32,7,2,3,5,11,2,1,0,0,3,2,2,2,3,0,2,1,4,0,0,2,2,3,2,2,2,2,1,2,0,2,4,1,0,2,0,0,1,6,2,5,2,11,3,1,3,2,2,2,0,2,1,6,51,3,20,2,4,1,1,2,7,2,2,3,3,1,1,7,1,1,3,2,11,7,1,4,5,0,1,3,6,2,1,3,3,2,17,4,5,2,0,5,1,5,1,1,1,5,1,7,4,6,0,3,2,17,0,0,3,2,4,0,1,4,6,3,3,14,0,4,0,2,2,1,6,3,7,5,3,9,17,9,2,3,1,0,3,4,3,0,9,3,1,16,7,10,12,9,2,2,4,1,0,9,2,1,3,0,5,5,0,7,1,4,3,7,11,23,1,4,0,17,20,0,3,1,1,4,16,3,7,1,0,9,0,10,7,5,4,5,18,1,3,7,1,0,13,11,1,0,4,3,1,3,2,30,6,0,1,5,0,14,0,15,0,2,1,9,8,0,4,5,2,24,12,3,3,2,3,7,11,39,8,1,21,16,2,0,5,2,25,12,4,11,0,2,5,0,0,11,1,10,0,2,1,6,2,4,5,0,3,7,7,0,0,7,4,0,0,3,9,3,6,11,1,17,5,1,2,2,1,0,3,0,1,2,1,5,2,3,12,8,0,4,0,1,1,5,4,9,5,4,1,7,15,2,2,3,0,0,3,4,4,1,0,4,5,2,1,2,2,34,0,3,1,0,0,16,3,8,6,4,3,1,0,5,8,2,5,7,1,11,3,10,0,13,1,3,0,2,1,17,5,16,6,1,3,1,6,12,14,3,23,1,1,5,7,4,1,1,0,0,0,26,24,13,36,4,5,12,2,2,2,18,0,0,15,3,17,5,0,4,2,10,1,1,1,1,3,2,50,20,5,5,1,13,2,1,2,9,8,0,0,1,0,4,8,4,1,0,2,5,1,1,3,2,11,25,2,6,1,0,11,2,1,10,4,5,17,10,0,11,18,0,1,17,6,18,4,5,14,2,0,0,3,2,2,5,0,6,23,9,4,10,26,0,4,20,7,5,1,4,3,3,2,0,3,0,12,1,6,1,9,5,2,5,3,19,1,0,9,8,6,2,0,0,0,2,1,5,4,2,4,6,2,0,2,5,5,0,18,1,3,6,2,0,9,1,7,0,3,0,0,4,2,1,0,2,1,0,0,2,1,2,3,0,1,2,3,0,1,3,1,0,1,0,0,0,0,2,1,0,1,2,0,0,4,10,1,2,1,2,0,5,0,2,1,2,3,1,0,4,1,2,1,4,1,0,1,0,3,4,2,3,1,0,7,3,2,0,1,1,1,1,0,7,3,0,0,1,0,1,4,0,1,0,1,1,0,2,1,0,9,2,3,6,5,9,2,1,5,1,0,3,1,1,5,0,6,2,1,3,0,1,2,0,1,1,2,1,8,2,2,0,4,1,0,2,0,4,0,6,3,0,0,5,1,0,2,4,0,0,0,0,0,0,3,1,12,1,1,4,0,4,2,1,0,3,1,0,1,4,2,1,6,2,2,2,1,8,1,3,2,2,6,0,1,1,4,7,8,1,3,0,1,1,2,2,14,1,17,0,1,9,24,0,1,3,2,9,5,19,3,10,18,6,0,28,1,4,3,8,3,28,12,2,13,4,4,11,6,0,2,3,0,2,1,4,4,6,3,0,0,2,17,1,0,0,0,3,2,2,66,9,18,6,12,3,16,7,15,5,5,4,2,5,18,61,10,0,1,3,0,2,3,2,0,2,0,5,2,0,0,1,3,0,1,5,2,1,2,0,3,0,1,1,8,0,1,0,1,0,45,0,16,14,2,2,4,11,4,7,2,5,2,3,0,2,0,2,5,1,1,0,0,6,0,1,0,1,3,2,2,8,4,2,4,2,4,4,5,0,0,6,13,3,7,0,7,2,0,6,4,0,0,4,3,0,0,1,0,2,4,2,0,1,3,4,4,5,10,6,3,0,0,1,2,3,1,5,3,0,3,1,24,12,0,4,7,8,0,8,0,3,2,24,30,15,5,12,19,25,58,3,6,4,27,4,2,0,3,5,22,16,8,11,9,9,17,1,4,13,1,27,1,2,1,2,0,0,3,0,10,13,2,4,17,1,17,3,11,0,3,2,9,1,9,2,0,0,1,1,19,5,5,1;
modified count = 1047
skipped count = 4781
mod + skipped = 5828
base count = 5708

@jkbonfield
Copy link

jkbonfield commented Nov 29, 2023 via email

@jrobinso
Copy link
Contributor

@jkbonfield Yes base count should be correct, I complement the base if strand is negative. I should add these are rather old test datasets, submitted along with bug reports and issue requests, so its not necessarily indicative of a widespread problem. However I think this test should catch most or at least many problematic alignments due to clipping in the absence of the MN tag. I hesitate to disable MM altogether for alignments with hard clipping since, apparently, some tools go to the trouble to correct the tags.

@jrobinso
Copy link
Contributor

jrobinso commented Dec 5, 2023

@marcus1487 I'd like to merge this PR which I think addresses this issue. As this thread has gotten a bit long I am repeating the description from above. Does it surprise you that some number of alignments fail the basic base count validation? This is true of every test dataset I have, which admittedly might not be representative. My hope is this basic validation catches clipping problems in the absence of an MN tag, but the alignments it is failing on do not have any recorded clipping.

I added the validation for MN tag (MN value == seq length) in the branch mn_tag_validation. I would like to be able to test it before merging, do you have a test dataset you can point me to? In fact, none of my test datasets seem to have any hard clipping.

In the absence of an MN tag I also added a validation comparing the # of implied bases from the MM tag vs the count of that base in the sequence itself. To my surprise this failed for 1/2 a dozen or so alignments in two of my test datasets. The sequence did not have enough "C" bases to cover the complete MM tag. I think I'm doing it right, example below.

In the absence of an MN tag I also added a validation comparing the # of implied bases from the MM tag vs the count of that base in the sequence itself. To my surprise this failed for 1/2 a dozen or so alignments in two of my test datasets. The sequence did not have enough "C" bases to cover the complete MM tag. I think I'm doing it right, example below.

C+m,7,1,4,9,4,2,15,2,1,0,4,14,3,5,1,2,1,2,7,2,3,12,0,2,5,3,2,1,0,1,10,1,0,9,1,6,5,0,0,4,3,6,0,1,1,1,6,0,0,4,1,2,7,4,1,3,1,2,9,0,1,4,6,11,5,3,10,1,15,0,6,1,0,2,0,2,10,8,1,32,7,2,3,5,11,2,1,0,0,3,2,2,2,3,0,2,1,4,0,0,2,2,3,2,2,2,2,1,2,0,2,4,1,0,2,0,0,1,6,2,5,2,11,3,1,3,2,2,2,0,2,1,6,51,3,20,2,4,1,1,2,7,2,2,3,3,1,1,7,1,1,3,2,11,7,1,4,5,0,1,3,6,2,1,3,3,2,17,4,5,2,0,5,1,5,1,1,1,5,1,7,4,6,0,3,2,17,0,0,3,2,4,0,1,4,6,3,3,14,0,4,0,2,2,1,6,3,7,5,3,9,17,9,2,3,1,0,3,4,3,0,9,3,1,16,7,10,12,9,2,2,4,1,0,9,2,1,3,0,5,5,0,7,1,4,3,7,11,23,1,4,0,17,20,0,3,1,1,4,16,3,7,1,0,9,0,10,7,5,4,5,18,1,3,7,1,0,13,11,1,0,4,3,1,3,2,30,6,0,1,5,0,14,0,15,0,2,1,9,8,0,4,5,2,24,12,3,3,2,3,7,11,39,8,1,21,16,2,0,5,2,25,12,4,11,0,2,5,0,0,11,1,10,0,2,1,6,2,4,5,0,3,7,7,0,0,7,4,0,0,3,9,3,6,11,1,17,5,1,2,2,1,0,3,0,1,2,1,5,2,3,12,8,0,4,0,1,1,5,4,9,5,4,1,7,15,2,2,3,0,0,3,4,4,1,0,4,5,2,1,2,2,34,0,3,1,0,0,16,3,8,6,4,3,1,0,5,8,2,5,7,1,11,3,10,0,13,1,3,0,2,1,17,5,16,6,1,3,1,6,12,14,3,23,1,1,5,7,4,1,1,0,0,0,26,24,13,36,4,5,12,2,2,2,18,0,0,15,3,17,5,0,4,2,10,1,1,1,1,3,2,50,20,5,5,1,13,2,1,2,9,8,0,0,1,0,4,8,4,1,0,2,5,1,1,3,2,11,25,2,6,1,0,11,2,1,10,4,5,17,10,0,11,18,0,1,17,6,18,4,5,14,2,0,0,3,2,2,5,0,6,23,9,4,10,26,0,4,20,7,5,1,4,3,3,2,0,3,0,12,1,6,1,9,5,2,5,3,19,1,0,9,8,6,2,0,0,0,2,1,5,4,2,4,6,2,0,2,5,5,0,18,1,3,6,2,0,9,1,7,0,3,0,0,4,2,1,0,2,1,0,0,2,1,2,3,0,1,2,3,0,1,3,1,0,1,0,0,0,0,2,1,0,1,2,0,0,4,10,1,2,1,2,0,5,0,2,1,2,3,1,0,4,1,2,1,4,1,0,1,0,3,4,2,3,1,0,7,3,2,0,1,1,1,1,0,7,3,0,0,1,0,1,4,0,1,0,1,1,0,2,1,0,9,2,3,6,5,9,2,1,5,1,0,3,1,1,5,0,6,2,1,3,0,1,2,0,1,1,2,1,8,2,2,0,4,1,0,2,0,4,0,6,3,0,0,5,1,0,2,4,0,0,0,0,0,0,3,1,12,1,1,4,0,4,2,1,0,3,1,0,1,4,2,1,6,2,2,2,1,8,1,3,2,2,6,0,1,1,4,7,8,1,3,0,1,1,2,2,14,1,17,0,1,9,24,0,1,3,2,9,5,19,3,10,18,6,0,28,1,4,3,8,3,28,12,2,13,4,4,11,6,0,2,3,0,2,1,4,4,6,3,0,0,2,17,1,0,0,0,3,2,2,66,9,18,6,12,3,16,7,15,5,5,4,2,5,18,61,10,0,1,3,0,2,3,2,0,2,0,5,2,0,0,1,3,0,1,5,2,1,2,0,3,0,1,1,8,0,1,0,1,0,45,0,16,14,2,2,4,11,4,7,2,5,2,3,0,2,0,2,5,1,1,0,0,6,0,1,0,1,3,2,2,8,4,2,4,2,4,4,5,0,0,6,13,3,7,0,7,2,0,6,4,0,0,4,3,0,0,1,0,2,4,2,0,1,3,4,4,5,10,6,3,0,0,1,2,3,1,5,3,0,3,1,24,12,0,4,7,8,0,8,0,3,2,24,30,15,5,12,19,25,58,3,6,4,27,4,2,0,3,5,22,16,8,11,9,9,17,1,4,13,1,27,1,2,1,2,0,0,3,0,10,13,2,4,17,1,17,3,11,0,3,2,9,1,9,2,0,0,1,1,19,5,5,1;
modified count = 1047
skipped count = 4781
mod + skipped = 5828
base count = 5708

@jrobinso
Copy link
Contributor

The verification step described above is included in release 2.17.0. This should catch hard-clipped sequences with uncorrected tags.

@jrobinso
Copy link
Contributor

@marcus1487 Well I might have to disable this validation by default because many files "in the wild" fail this basic validation count (expected bases vs actual bases in sequence).

@marcus1487
Copy link
Author

I would counter that this is a reason to leave the validation in place as opposed to removing it. If the tags are not valid according to these checks then it would risk visualizing incorrect calls and may lead to more harm than good. We are working on adding a modbase tag validation command to modkit (modkit tagstats) to help users diagnose these issues.

@jrobinso
Copy link
Contributor

jrobinso commented Feb 2, 2024

@marcus1487 Yes on a bit of reflection I agree, but only if you trust the validation. I'm thinking of annotating the alignment somehow to show that there are MM tags there but they aren't valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants