combo_prep.py running out of memory #6

tomoosting · 2020-12-11T00:16:14Z

Hi,

I am trying to run combo_prep.py with 350 whole-genome sequences.
The program keeps running out of memory (400Gb ram), and I'm pretty much pushing the limits of our system.

Looking online, I see suggestions for trying the batch process data into python.
Is there such an option for running this script with batch processing, or would I have to subdivide my linkage groups in order to read in the data?
I was assuming I'd have to analyse all samples for one genomics region in a single analyses.

Any advice on how to best approach an analyses of this many samples is very welcome and much appreciated.
Many thanks,
Tom

weissman · 2020-12-11T14:36:59Z

Hi Tom, Yeah, I think as it stands you'd have to subdivide the linkage groups. combo_prep.py is just a lightly enhanced version of msmc-tools' generate_multihetstep.py, so you could try running generate_multihetstep.py to see if it's something specific to the added code that's eating up all the memory or whether it's just the amount of data. Sorry not to have a better answer! Daniel

…

On Thu, Dec 10, 2020 at 7:16 PM tomoosting ***@***.***> wrote: Hi, I am trying to run combo_prep.py with 350 whole-genome sequences. The program keeps running out of memory (400Gb ram), and I'm pretty much pushing the limits of our system. Looking online, I see suggestions for trying the batch process data into python. Is there such an option for running this script with batch processing, or would I have to subdivide my linkage groups in order to read in the data? I was assuming I'd have to analyse all samples for one genomics region in a single analyses. Any advice on how to best approach an analyses of this many samples is very welcome and much appreciated. Many thanks, Tom — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACDUUQFBWYTH5BD2ODSTANDSUFQFXANCNFSM4UVZSQVA> .

tomoosting · 2020-12-13T20:46:41Z

Hi Daniel,

Will give that a try, thanks for the reply!

Cheers,
Tom

tomoosting · 2021-01-12T05:02:24Z

Hi Daniel,

I’ve reopened this issue as I’ve done some more research into why my runs keep crashing.
I’ve done a number of runs, increasing the number samples. Both with an entire chromosome and small section (50kbp).
In your 2017 publication you mention that memory use should increase linearly with increase in sample size. My resource consumption appears to be increasing exponentially with sample size (plot attached, x-axis is number of samples, and y-axis is memory used is in GB).

Even so that with 50 samples and a 50Kbp section my resource use reached over 200 GB of ram, giving me the following error:

Traceback (most recent call last):
File "/nfs/home/oostinto/bin/magic/combo_prep.py", line 181, in
File "/nfs/home/oostinto/bin/magic/combo_prep.py", line 115, in next
File "/nfs/home/oostinto/bin/magic/combo_prep.py", line 83, in addGenotype
MemoryError

I’d like to analyse close to 200 samples in a single run if possible, atleast 100. Any idea what might be going on? I've added the 50Kbp files of 50 samples to google drive.

I’ve used the following syntax, following examples recommendations ():
bcftools mpileup -q 20 -Q 20 -C 50 -r $LG -f $ref $bam_file | bcftools call -c -V indels | bamCaller.py $mean_coverage $out_ext'_'LG$LG.mask.bed.gz | gzip -c > $out_ext'_'LG$LG.vcf.gz

python3 combo_prep.py $in_ext/*'_'$LG.vcf.gz --masks $in_ext/*'_'$LG.mask.bed.gz --coverfile $LG_ext.$region'_cover.txt' > $LG_ext.$region.txt

Many thanks,
Tom

weissman · 2021-01-13T18:45:50Z

Hmm, can you try just running msmc-tools' generate_multihetstep.py on the output from bcftools to check how the memory usage compares? This would help me narrow down what could be causing the problem.

tomoosting · 2021-01-14T08:01:50Z

Memory usage is even higher when I run msmc-tools multistep..
X axis is N samples and y axis is GB ram used.
Analyses was run on the same linkage group.

weissman · 2021-01-14T15:51:43Z

Aha, thanks, this is helpful. I'll see if I can track it down.

…

On Thu, Jan 14, 2021 at 3:02 AM tomoosting ***@***.***> wrote: Memory usage is even higher when I run msmc-tools multistep.. X axis is N samples and y axis is GB ram used. Analyses was run on the same linkage group. [image: afbeelding] <https://user-images.githubusercontent.com/40846461/104560701-22e8a380-56ab-11eb-9d14-bcd4bea249b4.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACDUUQDMVQMIUU7BLKDSYLDSZ2QHZANCNFSM4UVZSQVA> .

tomoosting closed this as completed Dec 13, 2020

tomoosting reopened this Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combo_prep.py running out of memory #6

combo_prep.py running out of memory #6

tomoosting commented Dec 11, 2020

weissman commented Dec 11, 2020 via email

tomoosting commented Dec 13, 2020

tomoosting commented Jan 12, 2021

weissman commented Jan 13, 2021

tomoosting commented Jan 14, 2021

weissman commented Jan 14, 2021 via email

combo_prep.py running out of memory #6

combo_prep.py running out of memory #6

Comments

tomoosting commented Dec 11, 2020

weissman commented Dec 11, 2020 via email

tomoosting commented Dec 13, 2020

tomoosting commented Jan 12, 2021

weissman commented Jan 13, 2021

tomoosting commented Jan 14, 2021

weissman commented Jan 14, 2021 via email