Am I reaching the limits of omm-macse? #11

KPHendriks · 2024-03-27T19:05:37Z

Hello everyone,

I am not sure if this is the right place since no other issues have been raised on this git.

I have target capture sequence data for >1,000 nuclear (coding) genes, and created fasta files for ~2,000 samples each.
I have successfully been using the omm-macse pipeline using singularity on our bare metal machine as follows and I have been very happy with the results so far.

./omm_macse_v10.02.sif
--in_seq_file ${gene}_contig.fasta
--out_dir ${gene}_omm_macse_results
--out_file_prefix ${gene}
--java_mem 30g

This worked when including up to 1,000 to 1,500 samples in a single fasta file. However, now that my dataset is growing and exceeding 2,000 samples, it appears that omm-macse has a really hard time finishing, which may take many hours or some days for a single gene. Given the high number of genes I want to process, I am facing too long a total time.

Is there any way to speed up the process, maybe using further arguments for omm-macse?
I already tried upscaling to 100g, but this seems not really to help, and I 'only' have 378g, meaning I can not run multiple genes in parallel (like I was used to).

Looking forward to your help. It is much appreciated. :-)

Best wishes,

Kasper

ranwez · 2024-03-28T10:43:03Z

Hello,

yes it's the rigth place for questions, 10 others issues have been raised, answered and closed ;)

Trying to align 2000 sequences is indeed probably too much for OMM_MACSE pipeline. Our paper "Aligning protein-coding nucleotide sequences with MACSE" discuss various strategies to handle dataset of different size. Section 3.5 focuses on "Aligning thousands of sequences". It starts with "If you have a very large number of sequences, trying to align them simultaneously is dubious for several technical reasons [16]. It is preferable, as advised by R. Edgar, in the MUSCLE 3.8 [18] user guide (http://www.drive5.com/muscle/muscle_userguide3.8.html), to tackle this problem by leveraging clustering and alignment methods"

In your case you can split your 2000 sequence dataset in, let say, 4 dataset of 500 sequences (since you said that this the pipeline work fine in your case even for dataset of up to 1000 sequences, 500 should be fine). Then align each dataset with OMM_MACSE, this will lead to 4 alignments (ali1_NT.fasta, ali2_NT.fasta, ali3_NT.fasta, ali4_NT.fasta) and finally merge those alignments by using three times the alignTwoProfiles subprogram of macse, e.g :
java -jar macse.jar -prog alignTwoProfiles -p1 ali1_NT.fasta -p2 ali2_NT.fasta -out_NT ali12_NT.fasta
java -jar macse.jar -prog alignTwoProfiles -p1 ali12_NT.fasta -p2 ali3_NT.fasta -out_NT ali123_NT.fasta
java -jar macse.jar -prog alignTwoProfiles -p1 ali123_NT.fasta -p2 ali4_NT.fasta -out_NT ali_all_NT.fasta

The four dataset can be obtained using a clustering approach or just by splitting randomly your dataset in four.
Of course the idea is to write a small script that will do this automatically for one gene/dataset so that you can call this script on you 1,000 genes. All dataset and command lines discussed in the paper are available on our website

I hope this will allow you to get high quality alignment with macse. If you got any other questions do not hesitate to reach out.

Best regards,

Vincent Ranwez

Ranwez V, Chantret N, Delsuc F. Aligning Protein-Coding Nucleotide Sequences with MACSE. Methods Mol Biol. 2021;2231:51-70. doi: 10.1007/978-1-0716-1036-7_4. PMID: 33289886.

KPHendriks · 2024-03-28T10:51:36Z

Dear Vincent,

Many thanks for the suggestion. I will try this approach shortly.
This sounds indeed like the best approach for now. :-)

Best wishes,

Kasper

KPHendriks · 2024-03-28T11:47:10Z

Dear Vincent,

Many thanks for the suggestion. I will try this approach shortly.
This sounds indeed like the best approach for now. :-)

Best wishes,

Kasper

KPHendriks changed the title ~~I am reaching the limits of omm-macse?~~ Am I reaching the limits of omm-macse? Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Am I reaching the limits of omm-macse? #11

Am I reaching the limits of omm-macse? #11

KPHendriks commented Mar 27, 2024 •

edited

ranwez commented Mar 28, 2024

KPHendriks commented Mar 28, 2024

KPHendriks commented Mar 28, 2024

Am I reaching the limits of omm-macse? #11

Am I reaching the limits of omm-macse? #11

Comments

KPHendriks commented Mar 27, 2024 • edited

ranwez commented Mar 28, 2024

KPHendriks commented Mar 28, 2024

KPHendriks commented Mar 28, 2024

KPHendriks commented Mar 27, 2024 •

edited