Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probabilities greater than 1 for empirical taxon sampling with dnEpisodicBirthDeath #382

Open
Jeremy-Andreoletti opened this issue Sep 15, 2023 · 12 comments

Comments

@Jeremy-Andreoletti
Copy link
Contributor

Describe the bug
The code for empirical taxon sampling (clade-specific rho) with dnEpisodicBirthDeath throws an unexpected exception: "Problem in computing the probability of missing species in BDP."

To Reproduce
From the tutorial Diversification Rate Estimation with Missing Taxa, download mcmc_EBD.Rev and primates_tree.nex.

Then, follow the tutorial instruction in the "Empirical Taxon Sampling" section to modify mcmc_EBD.Rev (btw there is a typo, "Varecia_variegata_variegata" should be "Varecia_variegata"), or directly use mcmc_EBD_CSrho.Rev.txt.

Run mcmc_EBD_CSrho.Rev.

Expected behavior
Running the MCMC.

Computer info
MacOS Ventura 13.5.2 (Apple M1 Pro)
RevBayes version (1.2.1), and I also tried with the developer branch.

Additional context
The error arises in the computeLnProbabilityTimes function (in BirthDeathProcess.cpp).

for (size_t i=0; i<incomplete_clades.size(); ++i)
{
    // ...

    double p_0_T = 1.0 - pSurvival(0,present_time,1.0) * exp( rateIntegral(0,present_time) );
    double p_0_t = 1.0 - pSurvival(last_event_time,present_time,1.0) * exp( rateIntegral(last_event_time,present_time) );
    double log_F_t = log(p_0_t) - log(p_0_T);

    if ( log_F_t > 0.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

    // ...
}

I've observed that sometimes pSurvival values can be very low (down to e-300) and exp(rateIntegral) very high (up to exp(600)), so I suppose this can lead to numerical instabilities when p_0_T and p_0_t are expected to be very close.

The calculations are based on "Inferring Speciation and Extinction Rates under Different Sampling Schemes" and Likelihood Inference of Non-Constant Diversification Rates with Incomplete Taxon Sampling (@hoehna).

@bredelings
Copy link
Contributor

Thanks for the analysis. I am not familiar with this code, but I can see how it could be unstable. You could get an even better idea what is going on by printing the value of the variables when something goes wrong. For example:

    throw RbException()<<"Problem in computing the probability of missing species in BDP.    log_F_t = "<<log_F_t;

@bredelings
Copy link
Contributor

For the issue that you mentioned, one improvement could be to replace:

    double p_0_T = 1.0 - pSurvival(0,present_time,1.0) * exp( rateIntegral(0,present_time) );

with

    double p_0_T = 1.0 - exp( lnProbSurvival(0,present_time,1.0) + rateIntegral(0,present_time) );

I think there is a but in lnProbSurvival where it does return 0.0;, but it should do return log(0.0);

You would then make the equivalent change for p_0_t. I'm not sure if this will work or not -- ultimately lnProbSurvival( ) calls computeProbabilitySurvival(start, end) which is not on the log scale.

@bredelings
Copy link
Contributor

You could then replace 1.0 - exp( thing ) with -expm1( thing ). That will help numerical stability when thing is very small.

@Jeremy-Andreoletti
Copy link
Contributor Author

Thanks for the advice, I was optimistic that it might work but it doesn't seem to be enough.

double p_0_t = -expm1(lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time));
double p_0_T = -expm1(lnProbSurvival(0,present_time,1.0) + rateIntegral(0,present_time));
double log_F_t = log(p_0_t) - log(p_0_T);

if ( log_F_t > 0.0 )
{
        std::cout << std::setprecision (20) << "\npSurvival(last_event_time,present_time,1.0) = " << pSurvival(last_event_time,present_time,1.0) << "\nrateIntegral(last_event_time,present_time) = " << rateIntegral(last_event_time,present_time) << "\npSurvival(0,present_time,1.0) = " << pSurvival(0,present_time,1.0) << "\nrateIntegral(0,present_time) = " << rateIntegral(0,present_time);
        std::cout << "\np_0_t = " << p_0_t << "\np_0_T = " << p_0_T << "\nlog_F_t = " << log_F_t << "\npresent_time = " << present_time << "\nlast_event_time = " << last_event_time;
        throw RbException("Problem in computing the probability of missing species in BDP.");
}

I get the following values:

pSurvival(last_event_time,present_time,1.0) = 3.1140446871580990324e-30
rateIntegral(last_event_time,present_time) = 67.725437439794902161
pSurvival(0,present_time,1.0) = 6.4499208327301177728e-118
rateIntegral(0,present_time) = 269.62478019024678133
p_0_t = 0.19442013067406008209
p_0_T = 0.19442013067402572069
log_F_t = 1.7674750552032492124e-13
present_time = 65.091685983700003248
last_event_time = 48.750356713000002173
Error: Problem in computing the probability of missing species in BDP.

@bredelings
Copy link
Contributor

bredelings commented Sep 18, 2023

Hmm... so p_0_t and p_0_T are really close. That's good to know. It looks like p_0_T < p_0_t, but this is not supposed to happen. The differences are at the 13th decimal place, which is close to the limit on precision for doubles.

I'm guessing that

lnProbSurvival(0,present_time,1.0) = lnProbSurvival(0,last_event_time,1.0) + lnProbSurvival(last_event_time,1.0)
rateIntegral(0,present_time,1.0) = rateIntegral(0,last_event_time) + rateIntegral(last_event_time,present_time)

If that is true, then we could write something like:

double A = lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time)
double B = lnProbSurvival(0,last_event_time,1.0) + rateIntegral(0,last_event_time)
double p_0_t = -expm1(A);
double p_0_T = -expm1(A+B);
double log_F_t = log(p_0_t) - log(p_0_T);

It might then be possible to express log_F_t as a taylor series in terms of B when B is really small.

Also, I suspect that we need to have A<0 and B<0.

What do you think? If that checks out, then I have an idea. If it does not check out, then I have a different idea.

@Jeremy-Andreoletti
Copy link
Contributor Author

I tried the trick you suggest but the decomposition only seems to work for rateIntegral, not for lnProbSurvival. So -expm1(A+B) ends up being quite different from p_0_T.
Now, I'm curious to know what idea you have in mind for this alternative.

@bredelings
Copy link
Contributor

bredelings commented Sep 19, 2023

Drat. I guess that means that the probability of surviving from last_event_time to present_time is not independent of the event of surviving from time 0 to last_event_time.

OK, so if that didn't work, one possibility is that you could say that if the difference between p_0_t and p_0_T is really small then it should be truncated to 0, since the difference could be numerical error. So, something like

    if ( log_F_t > 0.0 )
    {
       if ( log_F_t < 1.0e-11 )
       {
           p_0_T = p_0_t;
           log_F_t = 0;
       }
       else
           throw RbException("Problem in computing the probability of missing species in BDP.");
    }

That is obviously kind of a hack.

Really, someone like @hoehna should probably take a look at this.

@Jeremy-Andreoletti
Copy link
Contributor Author

Yes, I've tried doing something like that, it runs a bit longer but it always ends up with an error at some point. Sometimes p_0_t or p_0_T becomes so small that it's only stored as 0, and this disrupts the calculation for good and you can get log_F_t > 0.1.

@bjoelle
Copy link
Contributor

bjoelle commented Oct 3, 2023

looking at the code, it seems we add m*log_F_t = m*[log(p_0_t) - log(p_0_T)] but then 5 lines later we add again m*log(p_0_T)
so we don't in fact need p_0_T or log_F_t at all, do we ? that may solve the issue if it's just a problem with the intermediate calculation

@Jeremy-Andreoletti
Copy link
Contributor Author

Wow, that's unexpected but it seems to work, thanks Joëlle!
I was able to remove log_F_t altogether and get the same likelihood with the following code:

// now iterate over the vector of missing species per interval
for (size_t i=0; i<incomplete_clades.size(); ++i)
{
    // We use equation (5) of Hoehna et al.
    // "Inferring Speciation and Extinction Rates under Different Sampling Schemes"
    double last_event_time = root_age - incomplete_clade_ages[i];
    
    double p_0_t = -expm1(lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time));

    if ( p_0_t < 0.0 || p_0_t > 1.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

    // get an estimate of the actual number of taxa
    int m = incomplete_clades[i].getNumberMissingTaxa();
    
    // multiply the probability for the missing species
    ln_prob_times += m *log(p_0_t);
}

return ln_prob_times;

@bredelings
Copy link
Contributor

bredelings commented Oct 6, 2023

looking at the code, it seems we add mlog_F_t = m[log(p_0_t) - log(p_0_T)] but then 5 lines later we add again m*log(p_0_T)

I guess that is correct, although it was took a while to see that. I guess (total_species - num_taxa) is equal to the sum of all the ms.

But I'm not sure that this is right:

    if ( p_0_t < 0.0 || p_0_t > 1.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

The original criterion throws an error if p_0_t > p_0_T. While it seems like you don't need p_0_T to compute the likelihood, an error here may mean that there is a bug in the code to compute both p_0_t and p_0_T.

        // We use equation (5) of Hoehna et al.
        // "Inferring Speciation and Extinction Rates under Different Sampling Schemes"
        double last_event_time = root_age - incomplete_clade_ages[i];

Probably someone should read this paper to figure out what is going on here? It would be helpful to know the meaning of p_0_t and p_0_T`, among other things.

@Jeremy-Andreoletti
Copy link
Contributor Author

The formula appears in this paper: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0084184&type=printable (eq:6), and is briefly explained at the end of this tutorial: https://revbayes.github.io/tutorials/divrate/sampling.html.

The quantity $F_t=1−\frac{1−P(N(T)&gt;0|N(t)=1)exp(r(t,T))}{1−P(N(T)&gt;0|N(t_1)=1)exp(r(t_1,T))}$ (which corresponds to 1 - F_t in the code if I've understood correctly), represents the "cumulative probability density of the time of a speciation event", which justifies checking log_F_t > 0.0.

Nevertheless, I agree that this solution risks masking an error/approximation in the calculation of lnProbSurvival or rateIntegral.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants