Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of significant genes different with several runs of Pyseer #264

Open
Samriddhi0906 opened this issue Feb 29, 2024 · 2 comments
Open
Assignees

Comments

@Samriddhi0906
Copy link

Samriddhi0906 commented Feb 29, 2024

After running Pyseer using

pyseer --phenotypes phenotypes.tsv --pres gene_presence_absence.Rtab --similarity phylogeny_similarity.tsv --lmm --covariates covariates.tsv --use-covariates 2 --cpu 8 > $1

and then filtering for significant genes using lrt-pvalue < 0.05 the number of significant genes varies between pyseer runs even though none of the input files have any changes.

In total 7 runs with covariates were run. Within these the lowest number of significant genes is 1245, the highest is 1395. Also, each run has a different number of significant genes.

The expectation would be that each run has the same number of significant genes. When filtering for filter-pvalue <0.05 the number of significant genes is constant.

Additionally, the number of significant genes after using covariates is about twice the number of significant genes without covariates (based on lrt-pvalue, however, they are the same when filtering using filter-pvalue).

Could you help me understand whether this behaviour is expected when running pyseer? Thanks in advance.

@mgalardini
Copy link
Owner

That comes a bit of a surprise, and this is not what we see in our unit tests, which return the same results every time. One thing I can think of is some stochasticity introduced when using multiple cores. Do you see the same variability when using a single core?

As an aside, a p-value threshold of 0.05 is likely too high, please refer to the docs for suggestions about setting such threshold.

@mgalardini mgalardini self-assigned this Feb 29, 2024
@Samriddhi0906
Copy link
Author

Thanks for your response. I did run it three times with 1 CPU and I still get variable results.
wicovariates_cpu1_1.tsv: 6268
wicovariates_cpu1_2.tsv: 6345
wicovariates_cpu1_3.tsv: 6357

As for the p-value threshold, this is just for filtering and comparison to see whether I am getting variable results between runs. For my analysis, I correct it for multiple testing before taking any further steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants