Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RELAX empty files in results, and results differing when re-run #1685

Closed
lmnano opened this issue Jan 30, 2024 · 7 comments
Closed

RELAX empty files in results, and results differing when re-run #1685

lmnano opened this issue Jan 30, 2024 · 7 comments

Comments

@lmnano
Copy link

lmnano commented Jan 30, 2024

Hi

We're currently using RELAX in analyses across 16-32 vertebrate species and I noticed some issues with the results.
First for a bit of context, we're running RELAX on multiple gene alignments. It is run using a bash script that takes all the individual gene alignments in a specific folder and runs RELAX for each alignment in a for loop. There are two problems occurring in the results:

  1. A few result files come up empty, if I later run the same alignment again separately I do get results.
  2. After this I decided to re-run a few alignments that produced results, to check if everything is consistent. It turns out it is not, and some results can differ significantly. After this happened I tested it a bit more and while results from further attempts (single command, or re-running the loop with less data) were consistent they differed from the original result.
    An example example of differing test results (relaxation parameter in this example):

Original run:

 "test results":{
   "LRT":4.340733070679562,
   "p-value":0.03721089856806759,
   "relaxation or intensification parameter":2.14281733753773
  },

Re-run:

 "test results":{
   "LRT":4.917692945022281,
   "p-value":0.02658299211893045,
   "relaxation or intensification parameter":0.6830123849475735
  }

I also completely re-ran (using the for loop) one of the shorter analysis, the only difference being the output folder. Some of the results were inconsistent. In the second run there were no empty files among the results and some numbers also differed.

My questions are, how to know which results are the right ones, and why are we getting empty files and different results?

@spond
Copy link
Member

spond commented Jan 30, 2024

Dear @lmnano,

  1. Which HyPhy version are you using?

  2. Try adding ENV="TOLERATE_NUMERICAL_ERRORS=1;" as a command line argument to help resolve empty results. My guess is that some files will occasionally trigger internal numerical stability checks, which, by default terminate the program with an error.

  3. For some datasets, RELAX can have convergence issues, i.e. be sensitive to starting conditions. Make sure you specify --starting-points S --grid-size N, where S is an integer (I suggest 5-10), and N is also an integer (I suggest 500-1000). This will ask RELAX to spend more time trying to different starting points for the optimization and may improve your results.

  4. For the more recent RELAX versions, there will also be convergence diagnostics in the JSON files

Convergence Checks and Reporting.

RELAX does two convergence checks.

1). Flat likelihood surface. After the alternative model is fitted, HyPhy will do a grid sample (varying K), and if the difference in log-likelihood is too small, then it will trigger the “sign” check:

If K was inferred to be > 1, the optimization will be done FORCING K ≤ 1 (and vice verse). If this results in a better LogL, analysis will be labeled as unstable.

2). If a negative LRT (null vs full) is encountered, this will trigger a refit and a warning.

Console output will look like this

### Fitting the alternative model to test K != 1
* Log(L) = -17180.93, AIC-c = 34476.29 (57 estimated parameters)
* Relaxation/intensification parameter (K) =     0.20
* The following rate distribution was inferred for **test** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.320     |   60.683    |                                   |
|         Neutral evolution         |     1.000     |   39.120    |                                   |
|      Diversifying selection       |     2.080     |    0.196    |                                   |

* The following rate distribution was inferred for **reference** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.003     |   60.683    |                                   |
|         Neutral evolution         |     1.000     |   39.120    |                                   |
|      Diversifying selection       |    38.511     |    0.196    |                                   |


### * Potential convergence issues due to flat likelihood surfaces; checking to see whether K > 1 or K < 1 is robustly inferred

### Potential for highly unreliable K inference due to multiple local maxima in the likelihood function, treat results with caution 
> Relaxation parameter reset to opposite mode of evolution from that obtained in the initial optimization.
* Log(L) = -17176.84, AIC-c = 34468.12 (57 estimated parameters)
* Relaxation/intensification parameter (K) =     1.98
* The following rate distribution was inferred for **test** branches

In the JSON file look for “convernce-*” keys in the analysis/settings object path

image

Best,
Sergei

@lmnano
Copy link
Author

lmnano commented Jan 31, 2024

Dear Sergei,

Thank you for a detailed answer.
The HyPhy version we're using is 2.5.51(MP)
I'll try your suggestions, and let you know if it worked ok. It might take a while to run the analysis again and get the results.

@spond
Copy link
Member

spond commented Jan 31, 2024

Dear @lmnano,

Please let me know how it goes.

Best,
Sergei

@lmnano
Copy link
Author

lmnano commented Feb 15, 2024

Dear Sergei,

This took a bit longer than expected. For now we're still getting empty files in the results. As for differing results, I need to find some time to check. I will let you know about that as well.

@spond
Copy link
Member

spond commented Feb 15, 2024

Dear @lmnano,

Do you have any error/message log details for the runs that fail? How are you scheduling these jobs?

Best,
Sergei

@lmnano
Copy link
Author

lmnano commented Feb 22, 2024

Dear Sergei,

It took me a while to get back to this.

First regarding empty files.

Currently we have an analysis that has between 16 and 22 vertebrate species in each alignment. The data is split by number of species in the alignments. In this analysis there are between 76 and 1623 alignments for each number of species.

RELAX is run on a Linux workstation in command line from a bash script that makes a list of all the alignments for a certain number of species in the analysis and runs RELAX for each one separately in a for loop. Each RELAX run is a consecutive instance of this for loop. There is no scheduler on the workstation, the script is run as a background job. The machine should be powerful enough to run this without any issues.

I have re-run RELAX for all of the results that came up empty and they produced results, although some had to be re-run twice. I have log files for each run, two different errors seem to occur. Since every log file covers all of the alignments in one run and is therefore very long, I copy-pasted only the part that is relevant to one of the alignments that produced an empty result and put it in the attachments, one for each kind of error:

error1.log
error2.log

Second the reproducibility issue.

I tested this more thoroughly by randomly selecting 100 alignments from one of the sets and re-running RELAX on them. One of them returned an empty file, which I discarded. I compared the rest with the previous results and the resulting selection type inferred from test results (relaxed, intensified or none) differed for about 15 % of the alignments. I used your suggested settings from a previous post in both runs.

Also, is there a way to test for consistency or convergence of the tested parameters in Hyphy?

Copy link

Stale issue message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants