samtools markup mark incompatible alignment(different CIGAR) as duplicates #1788

wulj2 · 2023-01-28T06:33:12Z

I am using samtools 1.16.1 Using htslib 1.16

and found the following bug in the markup algorithms

if some alignment records have same chr, start, end coordinates, although they might have different alignment details(such as indels), they can also be marked as duplicates, however, I think this might lead some loss of variant information in those alignments, for example, some patient with indel in some region and NGS sequences reads just mark this indel reads as duplicates as reference reads covering the same region, then no one will found this intel in the following pipeline.

r1	0	ref1	1	60	75M	*	0	0	ACACGTACTACATGACATGGGGGCATTAGCATTGACGATGGGGAAATTTATCATGGTAAACCCTAGCCGATCCAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:75	AS:i:75	XS:i:0
r2	1024	ref1	1	60	46M3D26M	*	0	0	ACACGTACTACATGACATGGGGGCATTAGCATTGACGATGGGGAAAATCATGGTAAACCCTAGCCGATCCAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF	NM:i:3	MD:Z:46^TTT26	AS:i:63	XS:i:0
r3	1024	ref1	1	60	39M4D32M	*	0	0	ACACGTACTACATGACATGGGGGCATTAGCATTGACGATAAATTTATCATGGTAAACCCTAGCCGATCCAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:4	MD:Z:39^GGGG32	AS:i:61	XS:i:0

the test data is here test.data.tar.gz
ref.fa is the reference
r1.fastq is the reads from ref.fa
test.bam is the alignment results of r1.fastq to ref.fa
mkdup.bam is the markup results generated by samtools markdup -s -f mkdup.stats test.bam mkdup.bam
mkdup.stats is the markup results stats file generated by the cmd before

I know samtools markup strongly rely on the coordinates of alignment, shall we take cigar compatibility check into consideration?
Will it be computation exhaustion if do so?

The text was updated successfully, but these errors were encountered:

whitwham · 2023-01-30T11:06:49Z

Yes, this is not so much a bug as a weakness in the algorithm (and once I've finished the documentation I am writing, a known weakness).

Though I am not sure how much of a weakness. Is this something you have come across with real data?

As well as slowing things down, matching CIGAR strings has its own problems. You can have reads that are duplicates but sequencing errors would make the CIGAR strings different and then you would miss the duplicates.

Still, it might be worth putting in as an optional check.

wulj2 · 2023-02-03T07:10:22Z

yeah, it is really difficulty to define a proper set of rules to check CIGARs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samtools markup mark incompatible alignment(different CIGAR) as duplicates #1788

samtools markup mark incompatible alignment(different CIGAR) as duplicates #1788

wulj2 commented Jan 28, 2023

whitwham commented Jan 30, 2023

wulj2 commented Feb 3, 2023

samtools markup mark incompatible alignment(different CIGAR) as duplicates #1788

samtools markup mark incompatible alignment(different CIGAR) as duplicates #1788

Comments

wulj2 commented Jan 28, 2023

whitwham commented Jan 30, 2023

wulj2 commented Feb 3, 2023