Database creation without orthofinder #4

mrouard · 2022-11-10T16:59:13Z

Hello,

I was wondering if this is possible to use shoot with existing database of multiple alignment and trees.
Let's say that I reproduce the same directories as orthofinder and include diamond databases, the msa (fasta) and gene trees (newick), would it be enough to get Shoot working?

Thank you

guignonv · 2023-06-01T08:55:06Z

It is possible if you provide the following structure and some changes in the files:

ShootDB
    ├── Gene_Trees
    ├── MultipleSequenceAlignments
    ├── Orthogroup_Sequences
    └── WorkingDirectory
        └── Alignments_ids

All your clusters need to be renamed using the OrthoFinder nomenclature scheme: "OG" + 7 digits starting from 0 for the first cluster and following numbers for the rest. Numbers must match OrthoMCL output. It means you'll have to have a cluster name lookup table if you use a different name scheme, to rename and match OG names against your cluster names. You'll also have to adjust those names in several places (file contents).

Gene_Trees should contain all your phylogenic trees in newick format (NHX not supported). The name scheme is "OG name" + "_tree.txt".
MultipleSequenceAlignments should contain all the alignments in FASTA format. The name scheme is "OG name" + ".fa".
Orthogroup_Sequences should contain all the cluster sequences in FASTA format. The name scheme is "OG name" + ".fa".
Note: all the same family sequences should be present in both the cluster fasta, its corresponding alignment and its corresponding tree.
WorkingDirectory should contain a set of files:
- SpeciesIDs.txt: contains the list of the species FASTA in your dataset, one per line, following the format: <species number (starting from 0)>: <species name>.faa. Ex.: 0: arath.faa, 1: orysa.faa
- SequenceIDs.txt: contains the list of all your dataset sequences, one per line, following the format: <species number>_<species sequence number>: <sequence name>. Ex.: 1_0: LOC_Os01g01050.1
- Species<species number>.fa: where "species number" corresponds to the species number given in SpeciesIDs.txt file. It contains the species sequences in FASTA format. Sequence names should be using the format <species number>_<species sequence number> as described in SequenceIDs.txt.
- clusters_OrthoFinder_I1.5.txt_id_pairs.txt: the OrthoMCL output matrix file ("out..I15" or ".I"). Cluster index must correspond to OG numbers. Sequence names (listed for each cluster) must follow another nomenclature: <species index> + underscore + <species sequence number> like specified in SequenceIDs.txt. Just all your clusters must be in that matrix, no more, no less (ie. if you discarded some clusters for your alignments and trees, they need to be removed from the matrix as well!).
- Alignments_ids: contains the same files as MultipleSequenceAlignments but with sequence names using the SequenceIDs.txt nomenclature (<species number>_<species sequence number>).

Then, SHOOT can be used to initialize the "SHOOT database" with those command lines:

python shoot/create_shoot_db.py <your "ShootDB" path> full
python shoot/create_shoot_db.py <your "ShootDB" path> profiles
python shoot/bifurcating_trees.py <your "ShootDB" path>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database creation without orthofinder #4

Database creation without orthofinder #4

mrouard commented Nov 10, 2022

guignonv commented Jun 1, 2023 •

edited

Database creation without orthofinder #4

Database creation without orthofinder #4

Comments

mrouard commented Nov 10, 2022

guignonv commented Jun 1, 2023 • edited

guignonv commented Jun 1, 2023 •

edited