New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Am I reaching the limits of omm-macse? #11
Comments
Hello, yes it's the rigth place for questions, 10 others issues have been raised, answered and closed ;) Trying to align 2000 sequences is indeed probably too much for OMM_MACSE pipeline. Our paper "Aligning protein-coding nucleotide sequences with MACSE" discuss various strategies to handle dataset of different size. Section 3.5 focuses on "Aligning thousands of sequences". It starts with "If you have a very large number of sequences, trying to align them simultaneously is dubious for several technical reasons [16]. It is preferable, as advised by R. Edgar, in the MUSCLE 3.8 [18] user guide (http://www.drive5.com/muscle/muscle_userguide3.8.html), to tackle this problem by leveraging clustering and alignment methods" In your case you can split your 2000 sequence dataset in, let say, 4 dataset of 500 sequences (since you said that this the pipeline work fine in your case even for dataset of up to 1000 sequences, 500 should be fine). Then align each dataset with OMM_MACSE, this will lead to 4 alignments (ali1_NT.fasta, ali2_NT.fasta, ali3_NT.fasta, ali4_NT.fasta) and finally merge those alignments by using three times the alignTwoProfiles subprogram of macse, e.g : The four dataset can be obtained using a clustering approach or just by splitting randomly your dataset in four. I hope this will allow you to get high quality alignment with macse. If you got any other questions do not hesitate to reach out. Best regards, Vincent Ranwez Ranwez V, Chantret N, Delsuc F. Aligning Protein-Coding Nucleotide Sequences with MACSE. Methods Mol Biol. 2021;2231:51-70. doi: 10.1007/978-1-0716-1036-7_4. PMID: 33289886. |
Dear Vincent, Many thanks for the suggestion. I will try this approach shortly. Best wishes, Kasper |
1 similar comment
Dear Vincent, Many thanks for the suggestion. I will try this approach shortly. Best wishes, Kasper |
@ranwez et al
Hello everyone,
I am not sure if this is the right place since no other issues have been raised on this git.
I have target capture sequence data for >1,000 nuclear (coding) genes, and created fasta files for ~2,000 samples each.
I have successfully been using the omm-macse pipeline using singularity on our bare metal machine as follows and I have been very happy with the results so far.
./omm_macse_v10.02.sif
--in_seq_file ${gene}_contig.fasta
--out_dir ${gene}_omm_macse_results
--out_file_prefix ${gene}
--java_mem 30g
This worked when including up to 1,000 to 1,500 samples in a single fasta file. However, now that my dataset is growing and exceeding 2,000 samples, it appears that omm-macse has a really hard time finishing, which may take many hours or some days for a single gene. Given the high number of genes I want to process, I am facing too long a total time.
Is there any way to speed up the process, maybe using further arguments for omm-macse?
I already tried upscaling to 100g, but this seems not really to help, and I 'only' have 378g, meaning I can not run multiple genes in parallel (like I was used to).
Looking forward to your help. It is much appreciated. :-)
Best wishes,
Kasper
The text was updated successfully, but these errors were encountered: