Different results in command line and web #1684

LalicJ · 2024-01-26T06:08:31Z

Hi, I met different results when I run busted analysis with command line and web(datamonkey.org).
My command line has the same parameters with web as follow : hyphy busted --alignment pal2nal_gapout_NHP_AAV.fasta --tree AAVrh85.treefile --branches Foreground --kill-zero-lengths No --error-sink No CPU=100 > hyphy_busted_85_new.out.
The analysis logs of the command and web versions are as follows:
hyphy_busted_85_new.txt
log.txt
The p value is less than 0.05 in the web version, but not in the command version. The only difference I can think of is that in the command version I entered a tree that I built myself. Do you have any suggestions? I'm really racking my brain to know why this is. Any help would be greatly appreciated. :)@spond

The text was updated successfully, but these errors were encountered:

spond · 2024-01-26T21:43:05Z

Dear @LalicJ,

This is quite unusual, since the log likelihood is different for all models including the baseline GTR. This leads me to hypothesize that the Datamonkey tree is different. Datamonkey might be using an NJ tree. This is also what your comment seems to indicate.

Using different trees can most decidedly affect the result of a selection analysis; since you are only using a single foreground branch, the analysis could be quite unstable (small sample size) I pulled your alignment file from Datamonkey and ran BUSTED locally, using a NJ tree, and using an ML tree built by raxml-ng.

The branch you are testing is rather short (~0.001, so effectively zero). For such a short branch there is no power to detect anything at all, so the expected result is no selection

ML tree

NJ tree (quite different!)

Because the alignment is small, you can also use 2 rates (especially since in your original runs, you were getting a lot of "Collapsed Rate" notices).

ML TREE

hyphy busted --alignment msa.fas --tree ml.nwk --starting-points 5 --rates 2 --syn-rates 2 --branches TEST

* For *test* branches, the following rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.996     |    0.000    |       Not supported by data       |
|      Diversifying selection       |     6.998     |   100.000   |                                   |


...

## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.4509**.

NJ TREE

hyphy busted --alignment msa.fas --tree nj.nwk --starting-points 5 --rates 2 --syn-rates 2 --branches TEST

* For *test* branches, the following rate distribution for branch-site combinations was inferred
|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.934     |   67.411    |                                   |
|      Diversifying selection       |     1.116     |   32.589    |                                   |


....
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.5000**.

OK, so what happened with the Datamonkey result? Looking at the result page, all of the signal comes from a single codon, and a single substitution (V → I)

However, if all you do is change the tree a little bit (like this NJ tree which HyPhy builds if you provide --tree neighbor-joining option), the substitution is now on the ancestral branch and does not contribute to the evolution on the focal lineage.

This highlights the inherent fragility of these "single-branch" analyses.

HTH,
Sergei

LalicJ · 2024-01-29T01:39:25Z

Thank you very much for your answer!!! The data set I used above is just an attempt to replicate the results of someone else's hyphy busted analysis. If, as you say, I actually have a very large data set (1000 + sequences but very close evolutionary distance), what is the recommended method to build the tree, and do the --rate and --starting-points parameters need to be changed? (Same to detect "single-branch")

spond · 2024-01-29T14:59:44Z

Dear @LalicJ,

Just to confirm : are you looking to detect selection on a single (predefined) branch using a large alignment of closely related species?

Best,
Sergei

LalicJ · 2024-01-30T00:28:13Z

Actually, I want to detect selection on a single (predefined) branch using a large alignment of closely related virus sequences. As a rule of thumb, a virus sequence with a positive selection signal is likely to perform better in some ways.

spond · 2024-01-30T14:02:10Z

Dear @LalicJ,

Great. If you have a single predefined branch, then I would suggest the following.

Be aware of the length of the focal branch. If it's too short (e.g. ~0.01 subs/site), then all the inference will be based on as few a single substitution. This will typically result in a loss of power.
Using 100+ or 1000+ closely related sequences to study the evolution of a single branch is likely not efficient (computationally or statistically). Because of the Markovian nature of the substitution process, the further away another sequence is from the sequence of interest, the less influence the former will have on the latter.

For example, if you wish to study the evolution of the highlighted BLUE branch in this (466 sequences) IAV tree (this is an example file https://github.com/veg/hyphy/blob/master/tests/data/IAV-human-H1N1-HA.nex), then the relatively distant sequences from a different clade (red box) are unlikely to have any noticeable impact on what the ω estimate on the blue branch is.

I would suggest trimming the analysis to just the neighborhood of the node, and the path from the node to the root.

Best,
Sergei

LalicJ · 2024-01-31T00:18:59Z

Thank you! I got it! I had also been struggling with the long detection time of large data sets. According to your suggestion, I will divide the data set and rebuild the tree for analysis. Thanks again for your help:).

spond · 2024-01-31T13:35:46Z

Dear @LalicJ,

Please let me know how it goes. I have found treemer (https://github.com/fmenardo/Treemmer) to be quite useful for tree trimming. Once you've trimmed the tree, you can use HyPhy scripts from https://github.com/veg/hyphy-analyses/tree/master/LabelTrees (trim-label-tree.bf) to make sure the alignment and the tree have the same set of sequences and to label the nodes you want labelled.

Best,
Sergei

LalicJ · 2024-02-18T03:52:38Z

Thanks for your advice, I will use treemer to re-analyze later. My current approach is to regroup the sequences to build multiple smaller trees and no problems found at the moment.

github-actions · 2024-04-19T00:15:06Z

Stale issue message

github-actions bot added the no-issue-activity label Apr 19, 2024

github-actions bot closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results in command line and web #1684

Different results in command line and web #1684

LalicJ commented Jan 26, 2024

spond commented Jan 26, 2024

LalicJ commented Jan 29, 2024

spond commented Jan 29, 2024

LalicJ commented Jan 30, 2024 •

edited

spond commented Jan 30, 2024

LalicJ commented Jan 31, 2024

spond commented Jan 31, 2024

LalicJ commented Feb 18, 2024

github-actions bot commented Apr 19, 2024

Different results in command line and web #1684

Different results in command line and web #1684

Comments

LalicJ commented Jan 26, 2024

spond commented Jan 26, 2024

LalicJ commented Jan 29, 2024

spond commented Jan 29, 2024

LalicJ commented Jan 30, 2024 • edited

spond commented Jan 30, 2024

LalicJ commented Jan 31, 2024

spond commented Jan 31, 2024

LalicJ commented Feb 18, 2024

github-actions bot commented Apr 19, 2024

LalicJ commented Jan 30, 2024 •

edited