Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdup on paired reads using --barcode-tag malfunctions if tag isn't present in both reads #1978

Open
eboyden opened this issue Jan 29, 2024 · 0 comments
Assignees

Comments

@eboyden
Copy link

eboyden commented Jan 29, 2024

Copied from #1710

Running samtools 1.19.2 on Linux x86_64

When determining duplicate status or selecting the best representative pair among a pool of duplicates, read pairs should always be considered together. Currently that is not always the case.

I am trying to pipe the output from umi-tools group to samtools markdup, taking advantage of the error correction of the former with the proper bitflag updating of the latter. However umi-tools only tags read 1 in a pair with the consensus UMI (in the BX tag). This causes samtools to apparently properly mark duplicates for read 1, but seeing no consensus UMI barcode tag in read 2, it marks duplicates based purely on position. Which produces data like the following:

Command:
$umitools group -I /dev/stdin --paired --output-bam --compresslevel 0 --extract-umi-method tag --umi-tag RX --umi-tag-delimiter=- --unmapped-reads output | $samtools markdup -@ 8 -m s -d 100 --duplicate-count --barcode-tag BX --write-index - $o.bam##idx##$o.bai

MN01688:12:000H3MG5N:1:11110:18319:15382        1187    chr1    999212  42      102M    =       999212  102     CTCGGGCGCTGGGGCCTCTGCGGGGGCAGCCGGGGACAGCGCGGCCGGGCGGCGGGAGGGTCCCAGCTGGCGCAGGCAGGCTGCCAGGTGGCCCAGCAGGCG
  /FFFAFFFFFFFFFF=/A/AAFFFFFF/AFFFF/FFAFF/F/FFFFFFAFFFFFFF/FFF/FFA/FF=F=FAF/FFF/FFFF/F//FFF/FFFF//A/FFFF  AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:41A60      YS:i:-11        YT:Z:CP RG:Z:A    RX:Z:ACTCG-GGGCA        MQ:i:42 MC:Z:102M       ms:i:3447       dt:Z:LB
MN01688:12:000H3MG5N:1:13102:17035:16176        163     chr1    999212  42      102M    =       999212  102     CTCGGGCGCTGGGGCCTCTGCGGGGGCAGCCGGGGACAGCGAGGCCGGGCGGCGGGAGGGTCCCAGCTGGCGCAGGCAGGCTGCCAGGTGGCCCAGCAGGCG
  =FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF6FFFAF/FFFFFFFAFFFFFFFFFFFFFA=FF=FFFFFFAFFFFFFFFFFFFFFFFFF//=FFFFFF  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:102        YS:i:0  YT:Z:CP RG:Z:A    RX:Z:GCACT-CGATG        MQ:i:42 MC:Z:102M       ms:i:3508       dc:i:2
MN01688:12:000H3MG5N:1:11110:18319:15382        83      chr1    999212  42      102M    =       999212  -102    CTCGGGCGCGGGGGCCGCTGCGGGGGCAGCCGGGGACAGCGAGGCCGGGCGGCGGGAGGGGCCCAGCTGGCGCAGGCAGGCTGCCAGGTGGCCCAGCAGGCG
  F//FFFFA/FFAFFFF/FAFFFF=FFFFFFFFFFFFFFFFF/FFFFFAAFAFAFFFFFFA/FFFFFAAFFFFFAFFAFFFFAF6FFFFAAFFFFFFFFFAFF  AS:i:-11        XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:9T6T43T41  YS:i:-3 YT:Z:CP RG:Z:A    RX:Z:ACTCG-GGGCA        MQ:i:42 MC:Z:102M       ms:i:2994       UG:i:42 BX:Z:ACTCGGGGCA dc:i:1
MN01688:12:000H3MG5N:1:13102:17035:16176        83      chr1    999212  42      102M    =       999212  -102    CTCGGGCGCTGGGGCCTCTGCGGGGGCAGCCGGGGACAGCGAGGCCGGGCGGCGGGAGGGTCCCAGCTGGCGCAGGCAGGCTGCCAGGTGGCCCAGCAGGCG
  //FFAAAFA/FFFFF/A/FFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFAFFFFAFFFFFFFFFFFF=FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:102        YS:i:0  YT:Z:CP RG:Z:A    RX:Z:GCACT-CGATG        MQ:i:42 MC:Z:102M       ms:i:3591       UG:i:43 BX:Z:GCACTCGATG dc:i:1

Both read 1s are marked unique despite the identical positions, based on different values in the BX tag. But both read 2s lack the BX tag, so samtools marks one as unique and the other as a duplicate. Hence the second read pair is discordant, with read 1 marked as unique and read 2 marked as a duplicate.

It would be best if samtools handled this elegantly - i.e. if the reads have different tag values or one is missing the tag, samtools uses whichever is present or defaults to read 1 if both are present but differ - but barring this, it might be worth making explicit in the documentation that if a barcode tag is specified, both reads in a pair must have it (and have the same value) or else erratic behavior may result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants