Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualimap assigns more reads to exonic regions than total reads in the sample #1273

Open
ctuni opened this issue Mar 21, 2024 · 0 comments
Open
Milestone

Comments

@ctuni
Copy link
Contributor

ctuni commented Mar 21, 2024

Description of the bug

A while ago we realized that QUALIMAP assigns more reads to exonic+intronic+intergenic than total reads in the sample. We subsampled the paired-end SRA FASTQ to exactly 1M fragments by head -n 4000000 file_{1,2}.fastq > subsampled_{1,2}.fastq.
Then, the pipeline runs the samples through STAR, then Picard markduplicates, and then qualimap (broadly speaking).

The issue is that although the samples have exactly 1M reads, and the initial FastQC correctly reports that there are indeed exactly 1M reads, Qualimap consistently assigns more than 1M reads to genomic features, such as ~1.5M reads mapping to exons, and then a bit more to introns and intergenic regions. The STAR logs also report exactly 1M reads to map. The reads assigned to genomic features by qualimap, when added up, do not add up to 2M, so Qualimap is not quantifying reads instead of fragments, which could explain the discrepancy.

The way qualimap runs when the variables are translated to arguments is correct (our samples are paired-end and the strandedness is reverse):

qualimap \
        rnaseq \
        -bam input.markdup.sorted.bam \
        -gtf annotation.gtf \
        -p strand-specific-reverse \
        -pe \
        -outdir out/

We also made sure that the counting algorithm is uniquely-mapped-reads, as is shown in this screenshot from the Qualimap report. It is my understanding that using this counting algorithm means that chimeric and secondary alignments are not counted, which could explain the higher number if I did not use this counting algorithm:
image

We know that this is not a nf-core/rnaseq issue but we also reported the same issue a while ago on Qualimap's BitBucket repo. The link to the issue is here but it can't be accessed because the issue has not been approved yet. It was submitted on the 22nd of January 2024, so 8 weeks have passed and there are no responses or updates.

Maybe there is an explanation that we are missing but we have tried everything and still can't explain how Qualimap assigns more reads to genomic features than total reads in the sample. Personally, we have decided to drop the Qualimap results and compute those statistics with an in-house script, and wanted to open the discussion here.

Command used and terminal output

No response

Relevant files

No response

System information

No response

@ctuni ctuni added the bug Something isn't working label Mar 21, 2024
@drpatelh drpatelh added this to the 3.15.0 milestone May 13, 2024
@drpatelh drpatelh added documentation and removed bug Something isn't working labels May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants