Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question regarding paper: running on SLURM/AWS Batch #99

Open
linminhtoo opened this issue Feb 1, 2024 · 2 comments
Open

question regarding paper: running on SLURM/AWS Batch #99

linminhtoo opened this issue Feb 1, 2024 · 2 comments

Comments

@linminhtoo
Copy link

linminhtoo commented Feb 1, 2024

hi again,

I have a question about the exact configuration of your runs for the experiments reported in the paper, which were conducted either on AWS Batch or SLURM.

For these large-scale GWAS runs, do you provide the input files as a single file (e.g. a single giant phenotype.txt file with N columns for N phenotypes, a single giant .vcf.gz file containing all variants, 3 giant .bed/.fam/.bam files), prepare 1 single config file, then execute the Nextflow run?

Then, how do you specify the resources for AWS Batch? Since there is no single AWS machine that has 450 CPUs, surely it has to be split up on smaller machines (eg 32 CPUs per machine). Or the alternative is do you split the input files by phenotype so with 10 phenotypes you have 10 runs, and then you execute 10 separate Nextflow runs?

Basically I am just curious about how you specified the total resources & orchestrated the entire workload

sorry for the repeated questions - I loved reading your paper & the code is extremely well written hence my interest in this work!

Thanks a lot

@seppinho
Copy link
Member

seppinho commented Feb 1, 2024

Short answer: We used the Sequera platform (aka nextflow tower) to run the GWAS on AWS Batch. By writing a config file, you can connect a nextflow pipeline to this system. it does all the magic and provided us all metadata regarding CPU hours, costs etc.

Input: one phenotype file, the prediction data is also one file in PLINK format (bim/bed/fam, normally the microarray data), the association files (normally the imputed data) are split by chromosome (but our pipeline can also split them into chunks und increase parallelization). Nextflow does the magic for us, we only have to specify the max number of cpus and then nextflow starts/stops VMs in the background for you (depending on how many are required by our pipeline). So for the REGENIE step 2, we split the input chromosomes (normally the imputed data) and then run all phenotypes for each chunk in parallel (so each chunk runs on all e.g. 400 phenotypes, REGENIE is designed that it can handle many phenotypes at once). But each step in our pipeline has a different parallelization level. But REGENIE is from a computationally perspective the most expensive, therefore users can split them into chunks within nf-gwas.

hope that helps.
Sebastian

@linminhtoo
Copy link
Author

linminhtoo commented Feb 2, 2024

Hey @seppinho , thanks a lot for this. It's very helpful

Unfortunately we might not be using Sequera but just AWS, so we might not have the ability to just specify "400 CPUs" and things happen magically to provision all the individual machines.

I'm quite curious about how Sequera achieves that. So if you specify 400 CPUs and put in a really huge triplet of bim/bed/fam files, then Sequera in its backend, will start up N machines (say 8 x 50 CPUs). then, somehow, from the main process, see all 400 CPUs as a "single machine" and be able to run say REGENIE step 1 on all 400 CPUs (instead of say internally splitting the input data into 8 sub-chunks, and then running each sub-chunk per machine).

Is that an accurate description of what happens on Sequera?

Unfortunately on the particular AWS service called HealthOmics that I'm looking at (no Sequera), I think it can't "magically" do this, so we have to do some manually splitting of the input data (say 1000 phenotypes gets split into 20 x 50 phenotypes, then we start 20 different Nextflow runs, each on a 128 CPU machine)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants