Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database creation without orthofinder #4

Open
mrouard opened this issue Nov 10, 2022 · 1 comment
Open

Database creation without orthofinder #4

mrouard opened this issue Nov 10, 2022 · 1 comment

Comments

@mrouard
Copy link

mrouard commented Nov 10, 2022

Hello,

I was wondering if this is possible to use shoot with existing database of multiple alignment and trees.
Let's say that I reproduce the same directories as orthofinder and include diamond databases, the msa (fasta) and gene trees (newick), would it be enough to get Shoot working?

Thank you

@guignonv
Copy link

guignonv commented Jun 1, 2023

It is possible if you provide the following structure and some changes in the files:

ShootDB
    ├── Gene_Trees
    ├── MultipleSequenceAlignments
    ├── Orthogroup_Sequences
    └── WorkingDirectory
        └── Alignments_ids

All your clusters need to be renamed using the OrthoFinder nomenclature scheme: "OG" + 7 digits starting from 0 for the first cluster and following numbers for the rest. Numbers must match OrthoMCL output. It means you'll have to have a cluster name lookup table if you use a different name scheme, to rename and match OG names against your cluster names. You'll also have to adjust those names in several places (file contents).

  • Gene_Trees should contain all your phylogenic trees in newick format (NHX not supported). The name scheme is "OG name" + "_tree.txt".

  • MultipleSequenceAlignments should contain all the alignments in FASTA format. The name scheme is "OG name" + ".fa".

  • Orthogroup_Sequences should contain all the cluster sequences in FASTA format. The name scheme is "OG name" + ".fa".
    Note: all the same family sequences should be present in both the cluster fasta, its corresponding alignment and its corresponding tree.

  • WorkingDirectory should contain a set of files:

    • SpeciesIDs.txt: contains the list of the species FASTA in your dataset, one per line, following the format: <species number (starting from 0)>: <species name>.faa. Ex.: 0: arath.faa, 1: orysa.faa
    • SequenceIDs.txt: contains the list of all your dataset sequences, one per line, following the format: <species number>_<species sequence number>: <sequence name>. Ex.: 1_0: LOC_Os01g01050.1
    • Species<species number>.fa: where "species number" corresponds to the species number given in SpeciesIDs.txt file. It contains the species sequences in FASTA format. Sequence names should be using the format <species number>_<species sequence number> as described in SequenceIDs.txt.
    • clusters_OrthoFinder_I1.5.txt_id_pairs.txt: the OrthoMCL output matrix file ("out..I15" or ".I"). Cluster index must correspond to OG numbers. Sequence names (listed for each cluster) must follow another nomenclature: <species index> + underscore + <species sequence number> like specified in SequenceIDs.txt. Just all your clusters must be in that matrix, no more, no less (ie. if you discarded some clusters for your alignments and trees, they need to be removed from the matrix as well!).
    • Alignments_ids: contains the same files as MultipleSequenceAlignments but with sequence names using the SequenceIDs.txt nomenclature (<species number>_<species sequence number>).

Then, SHOOT can be used to initialize the "SHOOT database" with those command lines:

python shoot/create_shoot_db.py <your "ShootDB" path> full
python shoot/create_shoot_db.py <your "ShootDB" path> profiles
python shoot/bifurcating_trees.py <your "ShootDB" path>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants