Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stringtie prepDE.py script does not take sequence length into consideration #287

Open
JohnHadish opened this issue Feb 2, 2024 · 1 comment
Labels

Comments

@JohnHadish
Copy link
Collaborator

Description of the bug

The prepDE.py script from Stringtie is used by GEMmaker to estimate raw reads for downstream differential expression. It is used in both the Hisat2 and STAR pipelinies. Currently, GEMmaker does not set an -l argument for this script, which is the average length of input reads. The default -l for the prepDE.py script is 75. In the example below, I changed the -l argument to 150 (the proper number for this dataset) and the output is dramatically different (multiplied by 2).

This will have a impact on downstream DE, as this can improperally push reads with low counts above the threshold of detection, and will cause differences to be exagerated. This is a critical Bug.

If I run with default (no -l argument) (default is 75) (this is how GEMmaker currently runs Hisat2 and Star)

TraesCS1A03G0000200,201
TraesCS1A03G0000400,260
TraesCS1A03G0000600,249
TraesCS1A03G0000800,12874
TraesCS1A03G0001000,21
TraesCS1A03G0001200,31
TraesCS1A03G0001300,341
TraesCS1A03G0001400,25
TraesCS1A03G0001500,3093

If I run with 150 as the -l argument (notice that everything is doubled)

TraesCS1A03G0000200,101
TraesCS1A03G0000400,130
TraesCS1A03G0000600,125
TraesCS1A03G0000800,6437
TraesCS1A03G0001000,11
TraesCS1A03G0001200,16
TraesCS1A03G0001300,171
TraesCS1A03G0001400,13
TraesCS1A03G0001500,1547

Command used and terminal output

No response

Relevant files

No response

System information

No response

@JohnHadish JohnHadish added the bug label Feb 2, 2024
@JohnHadish
Copy link
Collaborator Author

Test to see impact on DESeq2 downstream:

Incorrect = -l arguement (75) -- 4295 DEG genes
Correct = -l arguement (150) -- 3944 DEG genes

Overlap between the two groups - 3907 DEG genes

image
Figure 1: DEG which are extra in the Incorrect group (75 group) and DEG which are not included in the Incorrect group which are in the correct group

image
Figure 2: DEG Same, except this time plot base mean instead of padj

The Cutoff used was log2 threshold of at least 5 and padj less than 0.05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant