Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools ampliconclip takes quality scores from read start rather than specifed region #2048

Closed
Duda5 opened this issue May 9, 2024 · 1 comment · Fixed by #2053
Closed
Assignees

Comments

@Duda5
Copy link

Duda5 commented May 9, 2024

Package versions (compiled on Ubuntu 22.04.4 LTS x86_64):

samtools 1.19.2-11-g15d6538
htslib 1.19.1-19-g5d2c3f72

I am trying to extract read fragments and their quality scores from a specific region of bam file using the following logic and set of commands:

#getting full-length reads that overlap the region of interest (e.g. from position 1346 to position 1361)
bedtools intersect -a pGEMref_ref_mapped_sorted.bam -b roi.1346.1361.pad.15.bed > 1346.1361.pad.15.overlap.reads.bam
#hard-clipping everything upstream and downstream of that region
samtools ampliconclip --hard-clip --both-ends -b not_roi.1346.1361.pad.15.bed 1346.1361.pad.15.overlap.reads.bam > 1346.1361.pad.15.region.only.bam

This does leave (roughly) 15 bases of interest within the bam file and I can see trimmed alignments in IGV.
However, when I extract fastq of trimmed reads using:

samtools fastq 1346.1361.pad.15.region.only.bam > 1346.1361.pad.15.region.only.bam.fastq

I see that fastq file contains the right read id and sequence, but it takes the quality scores from the start of the read rather than the trimmed region.

For example, here is a read from the original fastq file (used for generating pGEMref_ref_mapped_sorted.bam):

@39c3f914-f553-4b37-b624-5163de6db964 runid=5c8e94390734c4376414f1402ae47de96e25ef4e read=722 ch=365 start_time=2024-04-11T13:29:37.714647+02:00 flow_cell_id=FAY74868 protocol_group_id=20240411_X1_FAY74868_run0222 sample_id=no_sample barcode=barcode73 barcode_alias=barcode73 parent_read_id=39c3f914-f553-4b37-b624-5163de6db964 basecall_model_version_id=dna_r10.4.1_e8.2_400bps_hac@v4.3.0
ATGTCTGGTACCCCTTCGACTACCGGTTGTGCAGCCATGCCGATATGGCAGAACGGGATGACAGAACCCTCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCGCTCATTAACCTCTAAAAATAGGCGTATCACGAGGCCCTTTCGTCTCGCGCGTTTCGGTGATGACGGTGAAAACCTCTGACACATGCAGCTCCCGGAGACGGTCACAGCTTGTCTGTAAGCGGATGCCGGGAGCAGACAAGCCCGTCAGGGCGCGTCAGCGGGTGTTGGCGGGTGTCGGGGCTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGAAATTGTAAGCGTTAATATTTTGTTAAAATTCGCGTTAAATTTTTGTTAAATCAGCTCATTTTTTAACCAATAGGCCGAAATCGGCAAAATCCCTTATAAATCAAAAGAATAGACCGAGATAGGGTTGAGTGTTGTTCCAGTTTGGAACAAGAGTCCACTATTAAAGAACGTGGACTCCAACGTCAAAGGGCGAAAAACCGTCTATCAGGGCGATGGCCCACTACGTGAACCATCACCCTAATCAAGTTTTTTGGGGTCGAGGTGCCGTAAAGCACTAAATCGGAACCCTAAAGGGAGCCCCGATTTACCAACTTTAACGGGGAAAGCCGGCGAACGTGGCGAGAGAAAGAGGAAGGGAAGAAAGCGAAAGGAGCGGGCGCTAGGGCGCTGGCAAGTGTAGCGGTCACGCTGCGCATAACCACCACACCCGCCGCGCTTAATGCGCCGCTACAGGGCGCGTCCATTCGCCATTCAGGCTGCGCAACTGTTGGGAAGGGCGATCGGTGCGGGCCTCTTCGCTATTACGCCAGCTGGCGAAAGGGGGATGTGCTGCAAGGCGATTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCTCTAGAGTCGACCTGCAGGCATGCAGAGCTTGAGTATTCTATAGTGTCACCTAAATAGCTTGGCGTAATCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACGACATTGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGAGTGCCAGCTGCATTAATGAATGGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTTCCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACTCAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAGCAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCACATTGCTCGCGTTCCCATGGCTCCGCCCCCCTGACGAGCATCACAAAATCGACGCTCAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCTGGGAGAGTTCCCTCGTGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCGCTTTCTCATAGCTCACGCTGTAGGTATCTCGGTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAGGATTAGCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCCTAACTACGGCTACACTAGAAGAACAGTATTTGGTATCTGCGCTCTGCTGAAGCCAGTTACCTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAAAAAAGGATCTCAAGAGATCCTTTAAATCTTCTACGGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAAGGATCTTCACCTAGATCCTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGTCTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACCGCGAGACCCACGCTCACCGGCTCCAGATTTATCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAAGTGGTCCTGCAACTTTATCCGCCTCCATCCAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCGCCAGTTAATAGTTTGCGCAACGTTGTTGCCATTGCTACGGCATCGTGGTGTCACGCTCGTCGTTTGGTATGGCTTCATTCAGCTCCGGTTCCCAACGATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGCGGTTAGCTCCTTCGGTCCTCCGATCGTTGTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTATGGCAGCACTGCATAATTCTCTTCTGTCATGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTCAACCAAGTCATTCTGAGAATCGTGTATGCGGCGACCCGTTGCTCTTGCCCGGCGTCAATACGGGATAATACCGCGCCACATGGCAGAACTTTAAAAGTGCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTCAAGGATCTTACCGCTGTTGAGATCCAGTTCGATGTAACCCACTCGTGCACCCGACTGATCTTCAGCATCTTTACTTTCACCAGCGTTTCTGGGTGAGCAAAAACAGGAAGGCAAAATGCCGCAAAAAAGGGAATAAGGGCGACACGGAAATGTTGAATACTCATACTCTTCCTTTTTCAATATTATTGAAGCATTTATCAGGGTTATCTTCTAGTCGGCGGATACATATTTGAATGTATTTTAGAAAAAATGAACAAATAGGGGTTCCGCGCACATTTACCGAAAAGTGCCACCTGACGTCTAAGAAACCGTT
+
%%''$$$$$$%%%&&&$%)*-)&%###%$$#$##$$$#"#$&&'&(%$%%%%'%+)))8777789:9=<<32219B510,+,2189:DEFFMKEOEDN7660001:E75558..,+4**,,0088>.../<7HM=9::=DDE8'''(.95=?>=;32312@=;98546.-001<??999:;>;77<>>:8-**+EFEEIJIEGEDAC:8NGHHGGINEDMHIGFIJH@@GB>=<7AAACCFE<@@???*((&'&(()(),*--.51:;8::999566=@<<::<8877:9'&&&9FJEBEEHFJICA@EEFKFJSMGEFJMMIMNLJGNOMJIHHHSOPLSNJGIMMHGHSOSHSIHKIQHHFHJIGQMNSHFKHLSOGJHSSBJ;@6CCDKMQMISSSFLHJIMIJPNHJGFEB>>,,,+2239/-.=AADDHIHEPL<99;7111+---9+((*FBGH@@@A8887778;93331'''';667:?BABIAFEF@@@>?ASODCCGFHSB@>??64--.=M<<:<DIGDEDBCCEEHNGDGERGGEFJJIJGGIHNGDCGLOHJLSJSEEKINA6444@@CBCFIE6558GSLFAEEGHG@>=BFE@A@;<<<C?=:76,---FEHLSFBDCDKISSCCFEBBDGSJ?@AQOHGMIGEENHDGA>>?B==?<<AGB@AAFIIHG:::=>6:7444742233-(&&&%()2(,,-24>====98233FDGG?>=<>)(((,+2/*-*..444457699:973//;;BDB889FG@@@2//0O@@@?FFKSPFISEFCEJSIBD;:5**./((()99....///<<8AFFIE>>>=JSHSGFC854,+*))*/44////5776<@8777>DHGKILSMIKNIBSG<++,565/30*((('%&&&%%%35<111/114;=:67446:<B?111/004;9=GGFB99710'&&'*/;--,-HEGEIKEHIKJHAA717AESLIS?BCAIJISJKIKLSRKQHHMNSMOGSSMSQJFCCBBNKSNLMMSGGGKSSSLHS>;;DLHJRFHDAAAGKG>>>@KF?444GIHI>@2100368:<==@JOJJSOIJKJFC3...3-))*.01-)*2;999:LGFNKHLSGIHH???<AFPKKGHJIMKOSSLMSSIKSESLDGAADCDGSKKSPISSJJKRLJFABA@AEEHGCBAA&&&%.(%%'.3/333?;<=EILIEGGMSIJHLNEFGFJDGEBBBGEDCCBDDBIFEGSGGSSSPINIMQOGLFCJCCA/...,-13558EFA:4585556ID?888+($$$(20045@GCDJGGFCEBD3224B>==>FA@334DJDEDKK4334?0-,,-./15?S>===C>787791,,<0../70(<<<>D>=>>CB::;=455@=<<85458D;446BBAAACCBJDCAC=55<@;;;;EDBEDE)((((((CGSLHMEJIIJIHA6435SRSJGIGC@<;<@FGSH@ASKI>==>ADINJDEGIOSIGRMSIGF?998::C:SBBNMGB?=5((&&&$**)))((%'-.+/,-.7666:>>ESEHECHNEGGDCB>?>?SIGB==<<@@6556>EEOGSIFMSILEDDBCGA@556?<888:AACECA::95=B@=>=<77<:433-,+)(&&%&)'128:@DCFAE=>:CBA;<3///5;1///<<;8<EDFGHDDCBBF=:4775<5.ABA@A@7889HMEFIFHEKJLSGD54225A@<A9:H@@@?I54432/,++''&&()(*4223;;99788=>>IEEDDRISGKSB@?=740/****069@@>:9;;<96;43334448<;?@>A>ADEFGHKNKIPSKFSFJG:;::<?<A@AE?@@@OFLA>?<<:=:6IHEFJGFFGSSLLSJLMJIGGJJOLSN;:88785673,,010436;<><=2225?DDCSSKFHDDEED?LGDGJSPKSMIJGMSEEDGHC;;:<BA843&&),+/;35>?AA@@BCRMJJHKMHEGEDEFIGDBBDMHGGIIB@91,.==@CEEIA=<:6>BGSHHMGF8;<<IGSIFHFHACB733479>?<9;@@A?A=,,,,1)2=@64462111@A778:<;;<?>9999;6326777996556543644**3243461222--+(()*349CE=<=BE@??/...DMNKS=>=?J;:89=HGD@=IIGCCD835:=433(((23422102578;;<<>=<=3+**'&%%&'--7SF??BINHJJMN>ISLSDSS9778:@@?BGINJQSLHHHGFCGFHGDHCFDBFHGLPKPGGD?AAFEFHJLISLKH4333GSNKD=>>@LIPG?>>>4333BBBIFLGGNHFHHHOILKGH>?=>HHOSNMHFDEC<885667J=<86444778;;==B9<DPSEDABJGIHBSSIEELJGMKLFHEGHGOSGFKFEFHGHSQGIJHKGIFFCDDSJFB@@EFFSMLSLDD9989>SFPHSLDDBGADEJHFOSQKILNS==<;<?BBBB@;<=>?<<<@22;93102677=>?HGAABDDA@>???D<<<?AA@=>>>BBB@???.*)*)))),)''*,+++8>?D@A@???<RHEEMHJGFGOKHQEEEBBBAMS@>000BKDEJIHIFSIGIIEIGGE=<;;AGEEBA&&&&/),479::;=:;5@CA@CEAB@111110232.)&%&'(,//044=FGHHGHNOJHIINHDJSSSKEHFGBA@<:420)+BIHRNHIFFDD===<>7666<9;9745:9>>>?@;:98??9:BECBCIIMHKJJFHNFFFG<<;:>>@LIFQIIHKD111DDIFHKJHGHKDJ22240/065<A???:::;IIHPOJOSIDED??>=?@BCA9::8:9:@@>8)'''DDCEDACB8889:777<=A@?@=@EI?@>IFHGGEGSEGHOIQSHHGECD>>70//0112(''+/3584>AJEGC@ABACAIGCCDDSHSHDSKKLJIISLSICB?CB;?-+++03<AFDDG=GIEFSISJFB@FFGEEEEHD111:C<<<EEDIEA@@ADEGKB((((?@LFGH<AHJEE<;=E.+++81332466FJHFGLEHFSHSEH:;;HGSANLJKSSHOSNLH@;A7@;<=MSHJ;KJDA?=>?GI>==>G==>?KSSFD.--/FFDFFO;999IJE<@<SSLBFEFP@A@@HFGGBLDIDHMM77677;>66000./)'')))).449<=?@@BC@=LIJFJ;:::;(%%%*<=<8(''&.-..*)))@?>5?455678:::9///,57:DGGDACBBGE?>>;<;;83///1.-('%

And here is the same read ID from 1346.1361.pad.15.region.only.fastq:

@39c3f914-f553-4b37-b624-5163de6db964
GTAGATAACTACGATAC
+
%%''$$$$$$%%%&&&$

You do not even need to search through original read to spot that quality scores are taken straight from the start of the parent (full-length) read, whereas the sequence itself does not come from read start (which is correct and expected).

> loc <- stringr::str_locate("ATGTCTGGTACCCCTTCGACTACCGGTTGTGCAGCCATGCCGATATGGCAGAACGGGATGACAGAACCCTCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCGCTCATTAACCTCTAAAAATAGGCGTATCACGAGGCCCTTTCGTCTCGCGCGTTTCGGTGATGACGGTGAAAACCTCTGACACATGCAGCTCCCGGAGACGGTCACAGCTTGTCTGTAAGCGGATGCCGGGAGCAGACAAGCCCGTCAGGGCGCGTCAGCGGGTGTTGGCGGGTGTCGGGGCTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGAAATTGTAAGCGTTAATATTTTGTTAAAATTCGCGTTAAATTTTTGTTAAATCAGCTCATTTTTTAACCAATAGGCCGAAATCGGCAAAATCCCTTATAAATCAAAAGAATAGACCGAGATAGGGTTGAGTGTTGTTCCAGTTTGGAACAAGAGTCCACTATTAAAGAACGTGGACTCCAACGTCAAAGGGCGAAAAACCGTCTATCAGGGCGATGGCCCACTACGTGAACCATCACCCTAATCAAGTTTTTTGGGGTCGAGGTGCCGTAAAGCACTAAATCGGAACCCTAAAGGGAGCCCCGATTTACCAACTTTAACGGGGAAAGCCGGCGAACGTGGCGAGAGAAAGAGGAAGGGAAGAAAGCGAAAGGAGCGGGCGCTAGGGCGCTGGCAAGTGTAGCGGTCACGCTGCGCATAACCACCACACCCGCCGCGCTTAATGCGCCGCTACAGGGCGCGTCCATTCGCCATTCAGGCTGCGCAACTGTTGGGAAGGGCGATCGGTGCGGGCCTCTTCGCTATTACGCCAGCTGGCGAAAGGGGGATGTGCTGCAAGGCGATTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCTCTAGAGTCGACCTGCAGGCATGCAGAGCTTGAGTATTCTATAGTGTCACCTAAATAGCTTGGCGTAATCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACGACATTGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGAGTGCCAGCTGCATTAATGAATGGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTTCCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACTCAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAGCAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCACATTGCTCGCGTTCCCATGGCTCCGCCCCCCTGACGAGCATCACAAAATCGACGCTCAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCTGGGAGAGTTCCCTCGTGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCGCTTTCTCATAGCTCACGCTGTAGGTATCTCGGTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAGGATTAGCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCCTAACTACGGCTACACTAGAAGAACAGTATTTGGTATCTGCGCTCTGCTGAAGCCAGTTACCTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAAAAAAGGATCTCAAGAGATCCTTTAAATCTTCTACGGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAAGGATCTTCACCTAGATCCTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGTCTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACCGCGAGACCCACGCTCACCGGCTCCAGATTTATCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAAGTGGTCCTGCAACTTTATCCGCCTCCATCCAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCGCCAGTTAATAGTTTGCGCAACGTTGTTGCCATTGCTACGGCATCGTGGTGTCACGCTCGTCGTTTGGTATGGCTTCATTCAGCTCCGGTTCCCAACGATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGCGGTTAGCTCCTTCGGTCCTCCGATCGTTGTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTATGGCAGCACTGCATAATTCTCTTCTGTCATGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTCAACCAAGTCATTCTGAGAATCGTGTATGCGGCGACCCGTTGCTCTTGCCCGGCGTCAATACGGGATAATACCGCGCCACATGGCAGAACTTTAAAAGTGCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTCAAGGATCTTACCGCTGTTGAGATCCAGTTCGATGTAACCCACTCGTGCACCCGACTGATCTTCAGCATCTTTACTTTCACCAGCGTTTCTGGGTGAGCAAAAACAGGAAGGCAAAATGCCGCAAAAAAGGGAATAAGGGCGACACGGAAATGTTGAATACTCATACTCTTCCTTTTTCAATATTATTGAAGCATTTATCAGGGTTATCTTCTAGTCGGCGGATACATATTTGAATGTATTTTAGAAAAAATGAACAAATAGGGGTTCCGCGCACATTTACCGAAAAGTGCCACCTGACGTCTAAGAAACCGTT", "GTAGATAACTACGATAC")
> cat(loc[1], loc[2])
2375 2391
@whitwham whitwham self-assigned this May 9, 2024
@whitwham
Copy link
Contributor

whitwham commented May 9, 2024

On a quick check I believe you are right. Both the left and right quality trimming take it off the right hand side.

Good find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants