Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools markup mark incompatible alignment(different CIGAR) as duplicates #1788

Open
wulj2 opened this issue Jan 28, 2023 · 2 comments
Open

Comments

@wulj2
Copy link
Contributor

wulj2 commented Jan 28, 2023

I am using samtools 1.16.1 Using htslib 1.16

and found the following bug in the markup algorithms

if some alignment records have same chr, start, end coordinates, although they might have different alignment details(such as indels), they can also be marked as duplicates, however, I think this might lead some loss of variant information in those alignments, for example, some patient with indel in some region and NGS sequences reads just mark this indel reads as duplicates as reference reads covering the same region, then no one will found this intel in the following pipeline.

r1	0	ref1	1	60	75M	*	0	0	ACACGTACTACATGACATGGGGGCATTAGCATTGACGATGGGGAAATTTATCATGGTAAACCCTAGCCGATCCAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:75	AS:i:75	XS:i:0
r2	1024	ref1	1	60	46M3D26M	*	0	0	ACACGTACTACATGACATGGGGGCATTAGCATTGACGATGGGGAAAATCATGGTAAACCCTAGCCGATCCAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF	NM:i:3	MD:Z:46^TTT26	AS:i:63	XS:i:0
r3	1024	ref1	1	60	39M4D32M	*	0	0	ACACGTACTACATGACATGGGGGCATTAGCATTGACGATAAATTTATCATGGTAAACCCTAGCCGATCCAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:4	MD:Z:39^GGGG32	AS:i:61	XS:i:0

the test data is here test.data.tar.gz
ref.fa is the reference
r1.fastq is the reads from ref.fa
test.bam is the alignment results of r1.fastq to ref.fa
mkdup.bam is the markup results generated by samtools markdup -s -f mkdup.stats test.bam mkdup.bam
mkdup.stats is the markup results stats file generated by the cmd before

I know samtools markup strongly rely on the coordinates of alignment, shall we take cigar compatibility check into consideration?
Will it be computation exhaustion if do so?

@whitwham
Copy link
Contributor

Yes, this is not so much a bug as a weakness in the algorithm (and once I've finished the documentation I am writing, a known weakness).

Though I am not sure how much of a weakness. Is this something you have come across with real data?

As well as slowing things down, matching CIGAR strings has its own problems. You can have reads that are duplicates but sequencing errors would make the CIGAR strings different and then you would miss the duplicates.

Still, it might be worth putting in as an optional check.

@wulj2
Copy link
Contributor Author

wulj2 commented Feb 3, 2023

yeah, it is really difficulty to define a proper set of rules to check CIGARs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants