Probabilities greater than 1 for empirical taxon sampling with dnEpisodicBirthDeath #382

Jeremy-Andreoletti · 2023-09-15T10:06:40Z

Describe the bug
The code for empirical taxon sampling (clade-specific rho) with dnEpisodicBirthDeath throws an unexpected exception: "Problem in computing the probability of missing species in BDP."

To Reproduce
From the tutorial Diversification Rate Estimation with Missing Taxa, download mcmc_EBD.Rev and primates_tree.nex.

Then, follow the tutorial instruction in the "Empirical Taxon Sampling" section to modify mcmc_EBD.Rev (btw there is a typo, "Varecia_variegata_variegata" should be "Varecia_variegata"), or directly use mcmc_EBD_CSrho.Rev.txt.

Run mcmc_EBD_CSrho.Rev.

Expected behavior
Running the MCMC.

Computer info
MacOS Ventura 13.5.2 (Apple M1 Pro)
RevBayes version (1.2.1), and I also tried with the developer branch.

Additional context
The error arises in the computeLnProbabilityTimes function (in BirthDeathProcess.cpp).

for (size_t i=0; i<incomplete_clades.size(); ++i)
{
    // ...

    double p_0_T = 1.0 - pSurvival(0,present_time,1.0) * exp( rateIntegral(0,present_time) );
    double p_0_t = 1.0 - pSurvival(last_event_time,present_time,1.0) * exp( rateIntegral(last_event_time,present_time) );
    double log_F_t = log(p_0_t) - log(p_0_T);

    if ( log_F_t > 0.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

    // ...
}

I've observed that sometimes pSurvival values can be very low (down to e-300) and exp(rateIntegral) very high (up to exp(600)), so I suppose this can lead to numerical instabilities when p_0_T and p_0_t are expected to be very close.

The calculations are based on "Inferring Speciation and Extinction Rates under Different Sampling Schemes" and Likelihood Inference of Non-Constant Diversification Rates with Incomplete Taxon Sampling (@hoehna).

The text was updated successfully, but these errors were encountered:

bredelings · 2023-09-15T15:01:19Z

Thanks for the analysis. I am not familiar with this code, but I can see how it could be unstable. You could get an even better idea what is going on by printing the value of the variables when something goes wrong. For example:

    throw RbException()<<"Problem in computing the probability of missing species in BDP.    log_F_t = "<<log_F_t;

bredelings · 2023-09-15T15:19:59Z

For the issue that you mentioned, one improvement could be to replace:

    double p_0_T = 1.0 - pSurvival(0,present_time,1.0) * exp( rateIntegral(0,present_time) );

with

    double p_0_T = 1.0 - exp( lnProbSurvival(0,present_time,1.0) + rateIntegral(0,present_time) );

I think there is a but in lnProbSurvival where it does return 0.0;, but it should do return log(0.0);

You would then make the equivalent change for p_0_t. I'm not sure if this will work or not -- ultimately lnProbSurvival( ) calls computeProbabilitySurvival(start, end) which is not on the log scale.

bredelings · 2023-09-15T15:21:10Z

You could then replace 1.0 - exp( thing ) with -expm1( thing ). That will help numerical stability when thing is very small.

Jeremy-Andreoletti · 2023-09-18T08:15:06Z

Thanks for the advice, I was optimistic that it might work but it doesn't seem to be enough.

double p_0_t = -expm1(lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time));
double p_0_T = -expm1(lnProbSurvival(0,present_time,1.0) + rateIntegral(0,present_time));
double log_F_t = log(p_0_t) - log(p_0_T);

if ( log_F_t > 0.0 )
{
        std::cout << std::setprecision (20) << "\npSurvival(last_event_time,present_time,1.0) = " << pSurvival(last_event_time,present_time,1.0) << "\nrateIntegral(last_event_time,present_time) = " << rateIntegral(last_event_time,present_time) << "\npSurvival(0,present_time,1.0) = " << pSurvival(0,present_time,1.0) << "\nrateIntegral(0,present_time) = " << rateIntegral(0,present_time);
        std::cout << "\np_0_t = " << p_0_t << "\np_0_T = " << p_0_T << "\nlog_F_t = " << log_F_t << "\npresent_time = " << present_time << "\nlast_event_time = " << last_event_time;
        throw RbException("Problem in computing the probability of missing species in BDP.");
}

I get the following values:

pSurvival(last_event_time,present_time,1.0) = 3.1140446871580990324e-30
rateIntegral(last_event_time,present_time) = 67.725437439794902161
pSurvival(0,present_time,1.0) = 6.4499208327301177728e-118
rateIntegral(0,present_time) = 269.62478019024678133
p_0_t = 0.19442013067406008209
p_0_T = 0.19442013067402572069
log_F_t = 1.7674750552032492124e-13
present_time = 65.091685983700003248
last_event_time = 48.750356713000002173
Error: Problem in computing the probability of missing species in BDP.

bredelings · 2023-09-18T14:42:22Z

Hmm... so p_0_t and p_0_T are really close. That's good to know. It looks like p_0_T < p_0_t, but this is not supposed to happen. The differences are at the 13th decimal place, which is close to the limit on precision for doubles.

I'm guessing that

lnProbSurvival(0,present_time,1.0) = lnProbSurvival(0,last_event_time,1.0) + lnProbSurvival(last_event_time,1.0)
rateIntegral(0,present_time,1.0) = rateIntegral(0,last_event_time) + rateIntegral(last_event_time,present_time)

If that is true, then we could write something like:

double A = lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time)
double B = lnProbSurvival(0,last_event_time,1.0) + rateIntegral(0,last_event_time)
double p_0_t = -expm1(A);
double p_0_T = -expm1(A+B);
double log_F_t = log(p_0_t) - log(p_0_T);

It might then be possible to express log_F_t as a taylor series in terms of B when B is really small.

Also, I suspect that we need to have A<0 and B<0.

What do you think? If that checks out, then I have an idea. If it does not check out, then I have a different idea.

Jeremy-Andreoletti · 2023-09-19T16:47:07Z

I tried the trick you suggest but the decomposition only seems to work for rateIntegral, not for lnProbSurvival. So -expm1(A+B) ends up being quite different from p_0_T.
Now, I'm curious to know what idea you have in mind for this alternative.

bredelings · 2023-09-19T16:56:29Z

Drat. I guess that means that the probability of surviving from last_event_time to present_time is not independent of the event of surviving from time 0 to last_event_time.

OK, so if that didn't work, one possibility is that you could say that if the difference between p_0_t and p_0_T is really small then it should be truncated to 0, since the difference could be numerical error. So, something like

    if ( log_F_t > 0.0 )
    {
       if ( log_F_t < 1.0e-11 )
       {
           p_0_T = p_0_t;
           log_F_t = 0;
       }
       else
           throw RbException("Problem in computing the probability of missing species in BDP.");
    }

That is obviously kind of a hack.

Really, someone like @hoehna should probably take a look at this.

Jeremy-Andreoletti · 2023-09-19T17:36:17Z

Yes, I've tried doing something like that, it runs a bit longer but it always ends up with an error at some point. Sometimes p_0_t or p_0_T becomes so small that it's only stored as 0, and this disrupts the calculation for good and you can get log_F_t > 0.1.

bjoelle · 2023-10-03T13:41:19Z

looking at the code, it seems we add m*log_F_t = m*[log(p_0_t) - log(p_0_T)] but then 5 lines later we add again m*log(p_0_T)
so we don't in fact need p_0_T or log_F_t at all, do we ? that may solve the issue if it's just a problem with the intermediate calculation

Jeremy-Andreoletti · 2023-10-06T13:43:45Z

Wow, that's unexpected but it seems to work, thanks Joëlle!
I was able to remove log_F_t altogether and get the same likelihood with the following code:

// now iterate over the vector of missing species per interval
for (size_t i=0; i<incomplete_clades.size(); ++i)
{
    // We use equation (5) of Hoehna et al.
    // "Inferring Speciation and Extinction Rates under Different Sampling Schemes"
    double last_event_time = root_age - incomplete_clade_ages[i];
    
    double p_0_t = -expm1(lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time));

    if ( p_0_t < 0.0 || p_0_t > 1.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

    // get an estimate of the actual number of taxa
    int m = incomplete_clades[i].getNumberMissingTaxa();
    
    // multiply the probability for the missing species
    ln_prob_times += m *log(p_0_t);
}

return ln_prob_times;

bredelings · 2023-10-06T14:10:47Z

looking at the code, it seems we add mlog_F_t = m[log(p_0_t) - log(p_0_T)] but then 5 lines later we add again m*log(p_0_T)

I guess that is correct, although it was took a while to see that. I guess (total_species - num_taxa) is equal to the sum of all the ms.

But I'm not sure that this is right:

    if ( p_0_t < 0.0 || p_0_t > 1.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

The original criterion throws an error if p_0_t > p_0_T. While it seems like you don't need p_0_T to compute the likelihood, an error here may mean that there is a bug in the code to compute both p_0_t and p_0_T.

        // We use equation (5) of Hoehna et al.
        // "Inferring Speciation and Extinction Rates under Different Sampling Schemes"
        double last_event_time = root_age - incomplete_clade_ages[i];

Probably someone should read this paper to figure out what is going on here? It would be helpful to know the meaning of p_0_t and p_0_T`, among other things.

Jeremy-Andreoletti · 2023-10-09T09:49:48Z

The formula appears in this paper: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0084184&type=printable (eq:6), and is briefly explained at the end of this tutorial: https://revbayes.github.io/tutorials/divrate/sampling.html.

The quantity $F_t=1−\frac{1−P(N(T)>0|N(t)=1)exp(r(t,T))}{1−P(N(T)>0|N(t_1)=1)exp(r(t_1,T))}$ (which corresponds to 1 - F_t in the code if I've understood correctly), represents the "cumulative probability density of the time of a speciation event", which justifies checking log_F_t > 0.0.

Nevertheless, I agree that this solution risks masking an error/approximation in the calculation of lnProbSurvival or rateIntegral.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probabilities greater than 1 for empirical taxon sampling with dnEpisodicBirthDeath #382

Probabilities greater than 1 for empirical taxon sampling with dnEpisodicBirthDeath #382

Jeremy-Andreoletti commented Sep 15, 2023

bredelings commented Sep 15, 2023

bredelings commented Sep 15, 2023

bredelings commented Sep 15, 2023

Jeremy-Andreoletti commented Sep 18, 2023

bredelings commented Sep 18, 2023 •

edited

Jeremy-Andreoletti commented Sep 19, 2023

bredelings commented Sep 19, 2023 •

edited

Jeremy-Andreoletti commented Sep 19, 2023

bjoelle commented Oct 3, 2023 •

edited

Jeremy-Andreoletti commented Oct 6, 2023

bredelings commented Oct 6, 2023 •

edited

Jeremy-Andreoletti commented Oct 9, 2023

Probabilities greater than 1 for empirical taxon sampling with dnEpisodicBirthDeath #382

Probabilities greater than 1 for empirical taxon sampling with dnEpisodicBirthDeath #382

Comments

Jeremy-Andreoletti commented Sep 15, 2023

bredelings commented Sep 15, 2023

bredelings commented Sep 15, 2023

bredelings commented Sep 15, 2023

Jeremy-Andreoletti commented Sep 18, 2023

bredelings commented Sep 18, 2023 • edited

Jeremy-Andreoletti commented Sep 19, 2023

bredelings commented Sep 19, 2023 • edited

Jeremy-Andreoletti commented Sep 19, 2023

bjoelle commented Oct 3, 2023 • edited

Jeremy-Andreoletti commented Oct 6, 2023

bredelings commented Oct 6, 2023 • edited

Jeremy-Andreoletti commented Oct 9, 2023

bredelings commented Sep 18, 2023 •

edited

bredelings commented Sep 19, 2023 •

edited

bjoelle commented Oct 3, 2023 •

edited

bredelings commented Oct 6, 2023 •

edited