Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringTie prepDE.py script needs length argument for calculating raw counts #266

Open
JohnHadish opened this issue Nov 14, 2022 · 1 comment
Labels

Comments

@JohnHadish
Copy link
Collaborator

Description of the bug

Impacts Hisat2 and Star aligner.
The prepDE.py script is used to calculate raw counts. It has a parameter called -l (length) which requires the user to input the average length of each read. It defaults to 75 bp. Currently, GEMmaker does not reset this parameter. This will result in incorrect raw read counts being reported to the user.

the code for prepDE.py

http://ccb.jhu.edu/software/stringtie/dl/prepDE.py3

the code which needs to be changed:

https://github.com/SystemsGenetics/GEMmaker/blob/master/modules/local/stringtie_fpkm_tpm.nf

Suggested change:
pass an argument with readlength to stringtie, use this parameter

Command used and terminal output

No response

Relevant files

No response

System information

No response

@JohnHadish JohnHadish added the bug label Nov 14, 2022
@spficklin
Copy link
Member

This is a problem and should be fixed, but I don't think it means GEMmaker is delivering counts that are not "correct". Even if the average read length was given properly (not the 75bp default), the count is still technically not "correct". It's an estimate using the formula reads_per_transcript = coverage * transcript_len / read_len. For example, for trimmed reads you can still have those that are shorter or longer than the average so the count won't be perfect.

I think the problem isn't with the counts being incorrect, it's more of a challenge of being able to compare across samples that have different average read lengths.

If your input files are all from the same experiment odds are all of the read lengths across the samples will be the same -- no problem.

If your input files are from samples that have different read lengths then a cross-sample normalization, I think, should clear things up. Folks with larger datasets should be checking for this and normalizing.

if your sample file is absolutely huge, then it's hard to normalize and there may be some comparison problems. But correlation analysis should still be okay (with some added noise).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants