Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combo_prep.py running out of memory #6

Open
tomoosting opened this issue Dec 11, 2020 · 6 comments
Open

combo_prep.py running out of memory #6

tomoosting opened this issue Dec 11, 2020 · 6 comments

Comments

@tomoosting
Copy link

Hi,

I am trying to run combo_prep.py with 350 whole-genome sequences.
The program keeps running out of memory (400Gb ram), and I'm pretty much pushing the limits of our system.

Looking online, I see suggestions for trying the batch process data into python.
Is there such an option for running this script with batch processing, or would I have to subdivide my linkage groups in order to read in the data?
I was assuming I'd have to analyse all samples for one genomics region in a single analyses.

Any advice on how to best approach an analyses of this many samples is very welcome and much appreciated.
Many thanks,
Tom

@weissman
Copy link
Contributor

weissman commented Dec 11, 2020 via email

@tomoosting
Copy link
Author

Hi Daniel,

Will give that a try, thanks for the reply!

Cheers,
Tom

@tomoosting
Copy link
Author

Hi Daniel,

I’ve reopened this issue as I’ve done some more research into why my runs keep crashing.
I’ve done a number of runs, increasing the number samples. Both with an entire chromosome and small section (50kbp).
In your 2017 publication you mention that memory use should increase linearly with increase in sample size. My resource consumption appears to be increasing exponentially with sample size (plot attached, x-axis is number of samples, and y-axis is memory used is in GB).

memory_used_combo_prep

Even so that with 50 samples and a 50Kbp section my resource use reached over 200 GB of ram, giving me the following error:

Traceback (most recent call last):
File "/nfs/home/oostinto/bin/magic/combo_prep.py", line 181, in
File "/nfs/home/oostinto/bin/magic/combo_prep.py", line 115, in next
File "/nfs/home/oostinto/bin/magic/combo_prep.py", line 83, in addGenotype
MemoryError

I’d like to analyse close to 200 samples in a single run if possible, atleast 100. Any idea what might be going on? I've added the 50Kbp files of 50 samples to google drive.

I’ve used the following syntax, following examples recommendations ():
bcftools mpileup -q 20 -Q 20 -C 50 -r $LG -f $ref $bam_file | bcftools call -c -V indels | bamCaller.py $mean_coverage $out_ext'_'LG$LG.mask.bed.gz | gzip -c > $out_ext'_'LG$LG.vcf.gz

python3 combo_prep.py $in_ext/*'_'$LG.vcf.gz --masks $in_ext/*'_'$LG.mask.bed.gz --coverfile $LG_ext.$region'_cover.txt' > $LG_ext.$region.txt

Many thanks,
Tom

@tomoosting tomoosting reopened this Jan 12, 2021
@weissman
Copy link
Contributor

Hmm, can you try just running msmc-tools' generate_multihetstep.py on the output from bcftools to check how the memory usage compares? This would help me narrow down what could be causing the problem.

@tomoosting
Copy link
Author

Memory usage is even higher when I run msmc-tools multistep..
X axis is N samples and y axis is GB ram used.
Analyses was run on the same linkage group.

afbeelding

@weissman
Copy link
Contributor

weissman commented Jan 14, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants