Qualimap assigns more reads to exonic regions than total reads in the sample #1273

ctuni · 2024-03-21T10:44:16Z

Description of the bug

A while ago we realized that QUALIMAP assigns more reads to exonic+intronic+intergenic than total reads in the sample. We subsampled the paired-end SRA FASTQ to exactly 1M fragments by head -n 4000000 file_{1,2}.fastq > subsampled_{1,2}.fastq.
Then, the pipeline runs the samples through STAR, then Picard markduplicates, and then qualimap (broadly speaking).

The issue is that although the samples have exactly 1M reads, and the initial FastQC correctly reports that there are indeed exactly 1M reads, Qualimap consistently assigns more than 1M reads to genomic features, such as ~1.5M reads mapping to exons, and then a bit more to introns and intergenic regions. The STAR logs also report exactly 1M reads to map. The reads assigned to genomic features by qualimap, when added up, do not add up to 2M, so Qualimap is not quantifying reads instead of fragments, which could explain the discrepancy.

The way qualimap runs when the variables are translated to arguments is correct (our samples are paired-end and the strandedness is reverse):

qualimap \
        rnaseq \
        -bam input.markdup.sorted.bam \
        -gtf annotation.gtf \
        -p strand-specific-reverse \
        -pe \
        -outdir out/

We also made sure that the counting algorithm is uniquely-mapped-reads, as is shown in this screenshot from the Qualimap report. It is my understanding that using this counting algorithm means that chimeric and secondary alignments are not counted, which could explain the higher number if I did not use this counting algorithm:

We know that this is not a nf-core/rnaseq issue but we also reported the same issue a while ago on Qualimap's BitBucket repo. The link to the issue is here but it can't be accessed because the issue has not been approved yet. It was submitted on the 22nd of January 2024, so 8 weeks have passed and there are no responses or updates.

Maybe there is an explanation that we are missing but we have tried everything and still can't explain how Qualimap assigns more reads to genomic features than total reads in the sample. Personally, we have decided to drop the Qualimap results and compute those statistics with an in-house script, and wanted to open the discussion here.

Command used and terminal output

No response

Relevant files

No response

System information

No response

The text was updated successfully, but these errors were encountered:

ctuni added the bug Something isn't working label Mar 21, 2024

drpatelh added this to the 3.15.0 milestone May 13, 2024

drpatelh added documentation and removed bug Something isn't working labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualimap assigns more reads to exonic regions than total reads in the sample #1273

Qualimap assigns more reads to exonic regions than total reads in the sample #1273

ctuni commented Mar 21, 2024 •

edited

Qualimap assigns more reads to exonic regions than total reads in the sample #1273

Qualimap assigns more reads to exonic regions than total reads in the sample #1273

Comments

ctuni commented Mar 21, 2024 • edited

Description of the bug

Command used and terminal output

Relevant files

System information

ctuni commented Mar 21, 2024 •

edited