You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to extract read fragments and their quality scores from a specific region of bam file using the following logic and set of commands:
#getting full-length reads that overlap the region of interest (e.g. from position 1346 to position 1361)
bedtools intersect -a pGEMref_ref_mapped_sorted.bam -b roi.1346.1361.pad.15.bed > 1346.1361.pad.15.overlap.reads.bam
#hard-clipping everything upstream and downstream of that region
samtools ampliconclip --hard-clip --both-ends -b not_roi.1346.1361.pad.15.bed 1346.1361.pad.15.overlap.reads.bam > 1346.1361.pad.15.region.only.bam
This does leave (roughly) 15 bases of interest within the bam file and I can see trimmed alignments in IGV.
However, when I extract fastq of trimmed reads using:
I see that fastq file contains the right read id and sequence, but it takes the quality scores from the start of the read rather than the trimmed region.
For example, here is a read from the original fastq file (used for generating pGEMref_ref_mapped_sorted.bam):
You do not even need to search through original read to spot that quality scores are taken straight from the start of the parent (full-length) read, whereas the sequence itself does not come from read start (which is correct and expected).
> loc <- stringr::str_locate("ATGTCTGGTACCCCTTCGACTACCGGTTGTGCAGCCATGCCGATATGGCAGAACGGGATGACAGAACCCTCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCGCTCATTAACCTCTAAAAATAGGCGTATCACGAGGCCCTTTCGTCTCGCGCGTTTCGGTGATGACGGTGAAAACCTCTGACACATGCAGCTCCCGGAGACGGTCACAGCTTGTCTGTAAGCGGATGCCGGGAGCAGACAAGCCCGTCAGGGCGCGTCAGCGGGTGTTGGCGGGTGTCGGGGCTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGAAATTGTAAGCGTTAATATTTTGTTAAAATTCGCGTTAAATTTTTGTTAAATCAGCTCATTTTTTAACCAATAGGCCGAAATCGGCAAAATCCCTTATAAATCAAAAGAATAGACCGAGATAGGGTTGAGTGTTGTTCCAGTTTGGAACAAGAGTCCACTATTAAAGAACGTGGACTCCAACGTCAAAGGGCGAAAAACCGTCTATCAGGGCGATGGCCCACTACGTGAACCATCACCCTAATCAAGTTTTTTGGGGTCGAGGTGCCGTAAAGCACTAAATCGGAACCCTAAAGGGAGCCCCGATTTACCAACTTTAACGGGGAAAGCCGGCGAACGTGGCGAGAGAAAGAGGAAGGGAAGAAAGCGAAAGGAGCGGGCGCTAGGGCGCTGGCAAGTGTAGCGGTCACGCTGCGCATAACCACCACACCCGCCGCGCTTAATGCGCCGCTACAGGGCGCGTCCATTCGCCATTCAGGCTGCGCAACTGTTGGGAAGGGCGATCGGTGCGGGCCTCTTCGCTATTACGCCAGCTGGCGAAAGGGGGATGTGCTGCAAGGCGATTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCTCTAGAGTCGACCTGCAGGCATGCAGAGCTTGAGTATTCTATAGTGTCACCTAAATAGCTTGGCGTAATCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACGACATTGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGAGTGCCAGCTGCATTAATGAATGGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTTCCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACTCAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAGCAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCACATTGCTCGCGTTCCCATGGCTCCGCCCCCCTGACGAGCATCACAAAATCGACGCTCAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCTGGGAGAGTTCCCTCGTGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCGCTTTCTCATAGCTCACGCTGTAGGTATCTCGGTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAGGATTAGCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCCTAACTACGGCTACACTAGAAGAACAGTATTTGGTATCTGCGCTCTGCTGAAGCCAGTTACCTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAAAAAAGGATCTCAAGAGATCCTTTAAATCTTCTACGGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAAGGATCTTCACCTAGATCCTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGTCTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACCGCGAGACCCACGCTCACCGGCTCCAGATTTATCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAAGTGGTCCTGCAACTTTATCCGCCTCCATCCAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCGCCAGTTAATAGTTTGCGCAACGTTGTTGCCATTGCTACGGCATCGTGGTGTCACGCTCGTCGTTTGGTATGGCTTCATTCAGCTCCGGTTCCCAACGATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGCGGTTAGCTCCTTCGGTCCTCCGATCGTTGTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTATGGCAGCACTGCATAATTCTCTTCTGTCATGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTCAACCAAGTCATTCTGAGAATCGTGTATGCGGCGACCCGTTGCTCTTGCCCGGCGTCAATACGGGATAATACCGCGCCACATGGCAGAACTTTAAAAGTGCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTCAAGGATCTTACCGCTGTTGAGATCCAGTTCGATGTAACCCACTCGTGCACCCGACTGATCTTCAGCATCTTTACTTTCACCAGCGTTTCTGGGTGAGCAAAAACAGGAAGGCAAAATGCCGCAAAAAAGGGAATAAGGGCGACACGGAAATGTTGAATACTCATACTCTTCCTTTTTCAATATTATTGAAGCATTTATCAGGGTTATCTTCTAGTCGGCGGATACATATTTGAATGTATTTTAGAAAAAATGAACAAATAGGGGTTCCGCGCACATTTACCGAAAAGTGCCACCTGACGTCTAAGAAACCGTT", "GTAGATAACTACGATAC")
> cat(loc[1], loc[2])
2375 2391
The text was updated successfully, but these errors were encountered:
Package versions (compiled on Ubuntu 22.04.4 LTS x86_64):
I am trying to extract read fragments and their quality scores from a specific region of bam file using the following logic and set of commands:
This does leave (roughly) 15 bases of interest within the bam file and I can see trimmed alignments in IGV.
However, when I extract fastq of trimmed reads using:
I see that fastq file contains the right read id and sequence, but it takes the quality scores from the start of the read rather than the trimmed region.
For example, here is a read from the original fastq file (used for generating
pGEMref_ref_mapped_sorted.bam
):And here is the same read ID from
1346.1361.pad.15.region.only.fastq
:You do not even need to search through original read to spot that quality scores are taken straight from the start of the parent (full-length) read, whereas the sequence itself does not come from read start (which is correct and expected).
The text was updated successfully, but these errors were encountered: