Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimise VariantSpark for large sample size (n>50K) #204

Open
natwine opened this issue Sep 24, 2021 · 1 comment
Open

optimise VariantSpark for large sample size (n>50K) #204

natwine opened this issue Sep 24, 2021 · 1 comment

Comments

@natwine
Copy link

natwine commented Sep 24, 2021

VariantSpark is currently optimised for reasonally small sample sizes (n=100-5000) and large numbers of variants (e.g. 42 million) , ie. 'wide' datasets. Working on phenotypes in UKBB, e.g. CAD we have samples sizes of ~50K at our disposal and VariantSpark has a long run time ( ~3day) when dealing with such sample sizes. As we expect genomic cohorts to grow in size it is worth considering how we can optimise VariantSpark for larger sample sizes (50K plus).

@DavidB-XI
Copy link

This work is inspiring, great method and deployment model!

You could always apply the idea of summarising the total number of samples into a reduced dimension, predicting in the reduced sample space, and then applying the learned parameters to predict the original variable.

This could prove useful for the VariantSpark method that works on millions of features, yet takes longer with tens of thousands of samples.

If you like this idea, I've implemented a method that finds an encoding of the sample space, reduces the samples enough to carry out a faster and more efficient regression, and then unfolds the prediction to make it seem as though it ran on the full sample space.

You can find this method here:
https://github.com/AskExplain/summary_sampling_via_folding/blob/main/prediction_using_fold_sampling.pdf

I've tried to run it using the sample 1000 Genomes dataset, but run into errors when installing the actual library on my local machine, so I can't apply this idea myself with VariantSpark unfortunately.

If you need help with translating the code to CSV / VCF files, let me know in this issue thread. If it works, let me know here too - would be great to work on this with the team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants