Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Am I reaching the limits of omm-macse? #11

Open
KPHendriks opened this issue Mar 27, 2024 · 3 comments
Open

Am I reaching the limits of omm-macse? #11

KPHendriks opened this issue Mar 27, 2024 · 3 comments

Comments

@KPHendriks
Copy link

KPHendriks commented Mar 27, 2024

@ranwez et al

Hello everyone,

I am not sure if this is the right place since no other issues have been raised on this git.

I have target capture sequence data for >1,000 nuclear (coding) genes, and created fasta files for ~2,000 samples each.
I have successfully been using the omm-macse pipeline using singularity on our bare metal machine as follows and I have been very happy with the results so far.

./omm_macse_v10.02.sif
--in_seq_file ${gene}_contig.fasta
--out_dir ${gene}_omm_macse_results
--out_file_prefix ${gene}
--java_mem 30g

This worked when including up to 1,000 to 1,500 samples in a single fasta file. However, now that my dataset is growing and exceeding 2,000 samples, it appears that omm-macse has a really hard time finishing, which may take many hours or some days for a single gene. Given the high number of genes I want to process, I am facing too long a total time.

Is there any way to speed up the process, maybe using further arguments for omm-macse?
I already tried upscaling to 100g, but this seems not really to help, and I 'only' have 378g, meaning I can not run multiple genes in parallel (like I was used to).

Looking forward to your help. It is much appreciated. :-)

Best wishes,

Kasper

@ranwez
Copy link
Owner

ranwez commented Mar 28, 2024

Hello,

yes it's the rigth place for questions, 10 others issues have been raised, answered and closed ;)

Trying to align 2000 sequences is indeed probably too much for OMM_MACSE pipeline. Our paper "Aligning protein-coding nucleotide sequences with MACSE" discuss various strategies to handle dataset of different size. Section 3.5 focuses on "Aligning thousands of sequences". It starts with "If you have a very large number of sequences, trying to align them simultaneously is dubious for several technical reasons [16]. It is preferable, as advised by R. Edgar, in the MUSCLE 3.8 [18] user guide (http://www.drive5.com/muscle/muscle_userguide3.8.html), to tackle this problem by leveraging clustering and alignment methods"

In your case you can split your 2000 sequence dataset in, let say, 4 dataset of 500 sequences (since you said that this the pipeline work fine in your case even for dataset of up to 1000 sequences, 500 should be fine). Then align each dataset with OMM_MACSE, this will lead to 4 alignments (ali1_NT.fasta, ali2_NT.fasta, ali3_NT.fasta, ali4_NT.fasta) and finally merge those alignments by using three times the alignTwoProfiles subprogram of macse, e.g :
java -jar macse.jar -prog alignTwoProfiles -p1 ali1_NT.fasta -p2 ali2_NT.fasta -out_NT ali12_NT.fasta
java -jar macse.jar -prog alignTwoProfiles -p1 ali12_NT.fasta -p2 ali3_NT.fasta -out_NT ali123_NT.fasta
java -jar macse.jar -prog alignTwoProfiles -p1 ali123_NT.fasta -p2 ali4_NT.fasta -out_NT ali_all_NT.fasta

The four dataset can be obtained using a clustering approach or just by splitting randomly your dataset in four.
Of course the idea is to write a small script that will do this automatically for one gene/dataset so that you can call this script on you 1,000 genes. All dataset and command lines discussed in the paper are available on our website

I hope this will allow you to get high quality alignment with macse. If you got any other questions do not hesitate to reach out.

Best regards,

Vincent Ranwez

Ranwez V, Chantret N, Delsuc F. Aligning Protein-Coding Nucleotide Sequences with MACSE. Methods Mol Biol. 2021;2231:51-70. doi: 10.1007/978-1-0716-1036-7_4. PMID: 33289886.

@KPHendriks
Copy link
Author

Dear Vincent,

Many thanks for the suggestion. I will try this approach shortly.
This sounds indeed like the best approach for now. :-)

Best wishes,

Kasper

1 similar comment
@KPHendriks
Copy link
Author

Dear Vincent,

Many thanks for the suggestion. I will try this approach shortly.
This sounds indeed like the best approach for now. :-)

Best wishes,

Kasper

@KPHendriks KPHendriks changed the title I am reaching the limits of omm-macse? Am I reaching the limits of omm-macse? Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants