question regarding paper: running on SLURM/AWS Batch #99

linminhtoo · 2024-02-01T12:51:32Z

hi again,

I have a question about the exact configuration of your runs for the experiments reported in the paper, which were conducted either on AWS Batch or SLURM.

For these large-scale GWAS runs, do you provide the input files as a single file (e.g. a single giant phenotype.txt file with N columns for N phenotypes, a single giant .vcf.gz file containing all variants, 3 giant .bed/.fam/.bam files), prepare 1 single config file, then execute the Nextflow run?

Then, how do you specify the resources for AWS Batch? Since there is no single AWS machine that has 450 CPUs, surely it has to be split up on smaller machines (eg 32 CPUs per machine). Or the alternative is do you split the input files by phenotype so with 10 phenotypes you have 10 runs, and then you execute 10 separate Nextflow runs?

Basically I am just curious about how you specified the total resources & orchestrated the entire workload

sorry for the repeated questions - I loved reading your paper & the code is extremely well written hence my interest in this work!

Thanks a lot

seppinho · 2024-02-01T14:26:02Z

Short answer: We used the Sequera platform (aka nextflow tower) to run the GWAS on AWS Batch. By writing a config file, you can connect a nextflow pipeline to this system. it does all the magic and provided us all metadata regarding CPU hours, costs etc.

Input: one phenotype file, the prediction data is also one file in PLINK format (bim/bed/fam, normally the microarray data), the association files (normally the imputed data) are split by chromosome (but our pipeline can also split them into chunks und increase parallelization). Nextflow does the magic for us, we only have to specify the max number of cpus and then nextflow starts/stops VMs in the background for you (depending on how many are required by our pipeline). So for the REGENIE step 2, we split the input chromosomes (normally the imputed data) and then run all phenotypes for each chunk in parallel (so each chunk runs on all e.g. 400 phenotypes, REGENIE is designed that it can handle many phenotypes at once). But each step in our pipeline has a different parallelization level. But REGENIE is from a computationally perspective the most expensive, therefore users can split them into chunks within nf-gwas.

hope that helps.
Sebastian

linminhtoo · 2024-02-02T03:21:44Z

Hey @seppinho , thanks a lot for this. It's very helpful

Unfortunately we might not be using Sequera but just AWS, so we might not have the ability to just specify "400 CPUs" and things happen magically to provision all the individual machines.

I'm quite curious about how Sequera achieves that. So if you specify 400 CPUs and put in a really huge triplet of bim/bed/fam files, then Sequera in its backend, will start up N machines (say 8 x 50 CPUs). then, somehow, from the main process, see all 400 CPUs as a "single machine" and be able to run say REGENIE step 1 on all 400 CPUs (instead of say internally splitting the input data into 8 sub-chunks, and then running each sub-chunk per machine).

Is that an accurate description of what happens on Sequera?

Unfortunately on the particular AWS service called HealthOmics that I'm looking at (no Sequera), I think it can't "magically" do this, so we have to do some manually splitting of the input data (say 1000 phenotypes gets split into 20 x 50 phenotypes, then we start 20 different Nextflow runs, each on a 128 CPU machine)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question regarding paper: running on SLURM/AWS Batch #99

question regarding paper: running on SLURM/AWS Batch #99

linminhtoo commented Feb 1, 2024 •

edited

seppinho commented Feb 1, 2024

linminhtoo commented Feb 2, 2024 •

edited

question regarding paper: running on SLURM/AWS Batch #99

question regarding paper: running on SLURM/AWS Batch #99

Comments

linminhtoo commented Feb 1, 2024 • edited

seppinho commented Feb 1, 2024

linminhtoo commented Feb 2, 2024 • edited

linminhtoo commented Feb 1, 2024 •

edited

linminhtoo commented Feb 2, 2024 •

edited