Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results in command line and web #1684

Closed
LalicJ opened this issue Jan 26, 2024 · 9 comments
Closed

Different results in command line and web #1684

LalicJ opened this issue Jan 26, 2024 · 9 comments

Comments

@LalicJ
Copy link

LalicJ commented Jan 26, 2024

Hi, I met different results when I run busted analysis with command line and web(datamonkey.org).
My command line has the same parameters with web as follow : hyphy busted --alignment pal2nal_gapout_NHP_AAV.fasta --tree AAVrh85.treefile --branches Foreground --kill-zero-lengths No --error-sink No CPU=100 > hyphy_busted_85_new.out.
The analysis logs of the command and web versions are as follows:
hyphy_busted_85_new.txt
log.txt
The p value is less than 0.05 in the web version, but not in the command version. The only difference I can think of is that in the command version I entered a tree that I built myself. Do you have any suggestions? I'm really racking my brain to know why this is. Any help would be greatly appreciated. :)@spond

@spond
Copy link
Member

spond commented Jan 26, 2024

Dear @LalicJ,

This is quite unusual, since the log likelihood is different for all models including the baseline GTR. This leads me to hypothesize that the Datamonkey tree is different. Datamonkey might be using an NJ tree. This is also what your comment seems to indicate.

Using different trees can most decidedly affect the result of a selection analysis; since you are only using a single foreground branch, the analysis could be quite unstable (small sample size) I pulled your alignment file from Datamonkey and ran BUSTED locally, using a NJ tree, and using an ML tree built by raxml-ng.

The branch you are testing is rather short (~0.001, so effectively zero). For such a short branch there is no power to detect anything at all, so the expected result is no selection

ML tree

image

NJ tree (quite different!)

image

Because the alignment is small, you can also use 2 rates (especially since in your original runs, you were getting a lot of "Collapsed Rate" notices).

ML TREE

hyphy busted --alignment msa.fas --tree ml.nwk --starting-points 5 --rates 2 --syn-rates 2 --branches TEST

* For *test* branches, the following rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.996     |    0.000    |       Not supported by data       |
|      Diversifying selection       |     6.998     |   100.000   |                                   |


...

## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.4509**.

NJ TREE

hyphy busted --alignment msa.fas --tree nj.nwk --starting-points 5 --rates 2 --syn-rates 2 --branches TEST

* For *test* branches, the following rate distribution for branch-site combinations was inferred
|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.934     |   67.411    |                                   |
|      Diversifying selection       |     1.116     |   32.589    |                                   |


....
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.5000**.

OK, so what happened with the Datamonkey result? Looking at the result page, all of the signal comes from a single codon, and a single substitution (V → I)

image image

However, if all you do is change the tree a little bit (like this NJ tree which HyPhy builds if you provide --tree neighbor-joining option), the substitution is now on the ancestral branch and does not contribute to the evolution on the focal lineage.

image

This highlights the inherent fragility of these "single-branch" analyses.

HTH,
Sergei

@LalicJ
Copy link
Author

LalicJ commented Jan 29, 2024

Thank you very much for your answer!!! The data set I used above is just an attempt to replicate the results of someone else's hyphy busted analysis. If, as you say, I actually have a very large data set (1000 + sequences but very close evolutionary distance), what is the recommended method to build the tree, and do the --rate and --starting-points parameters need to be changed? (Same to detect "single-branch")

@spond
Copy link
Member

spond commented Jan 29, 2024

Dear @LalicJ,

Just to confirm : are you looking to detect selection on a single (predefined) branch using a large alignment of closely related species?

Best,
Sergei

@LalicJ
Copy link
Author

LalicJ commented Jan 30, 2024

Actually, I want to detect selection on a single (predefined) branch using a large alignment of closely related virus sequences. As a rule of thumb, a virus sequence with a positive selection signal is likely to perform better in some ways.

@spond
Copy link
Member

spond commented Jan 30, 2024

Dear @LalicJ,

Great. If you have a single predefined branch, then I would suggest the following.

  1. Be aware of the length of the focal branch. If it's too short (e.g. ~0.01 subs/site), then all the inference will be based on as few a single substitution. This will typically result in a loss of power.
  2. Using 100+ or 1000+ closely related sequences to study the evolution of a single branch is likely not efficient (computationally or statistically). Because of the Markovian nature of the substitution process, the further away another sequence is from the sequence of interest, the less influence the former will have on the latter.

For example, if you wish to study the evolution of the highlighted BLUE branch in this (466 sequences) IAV tree (this is an example file https://github.com/veg/hyphy/blob/master/tests/data/IAV-human-H1N1-HA.nex), then the relatively distant sequences from a different clade (red box) are unlikely to have any noticeable impact on what the ω estimate on the blue branch is.

image

I would suggest trimming the analysis to just the neighborhood of the node, and the path from the node to the root.

Best,
Sergei

@LalicJ
Copy link
Author

LalicJ commented Jan 31, 2024

Thank you! I got it! I had also been struggling with the long detection time of large data sets. According to your suggestion, I will divide the data set and rebuild the tree for analysis. Thanks again for your help:).

@spond
Copy link
Member

spond commented Jan 31, 2024

Dear @LalicJ,

Please let me know how it goes. I have found treemer (https://github.com/fmenardo/Treemmer) to be quite useful for tree trimming. Once you've trimmed the tree, you can use HyPhy scripts from https://github.com/veg/hyphy-analyses/tree/master/LabelTrees (trim-label-tree.bf) to make sure the alignment and the tree have the same set of sequences and to label the nodes you want labelled.

Best,
Sergei

@LalicJ
Copy link
Author

LalicJ commented Feb 18, 2024

Thanks for your advice, I will use treemer to re-analyze later. My current approach is to regroup the sequences to build multiple smaller trees and no problems found at the moment.

Copy link

Stale issue message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants