Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with pglmm Bayes=TRUE model syntax #73

Open
Flaiba opened this issue Jul 25, 2022 · 14 comments
Open

Problem with pglmm Bayes=TRUE model syntax #73

Flaiba opened this issue Jul 25, 2022 · 14 comments
Assignees

Comments

@Flaiba
Copy link

Flaiba commented Jul 25, 2022

hello phyr team.
I am interested in replicating the analysis carried out by Vincze et al. 2022 (https://doi.org/10.1038/s41586-021-04224-5), employing other explicatory variables using the library phyr since this package might allow better analysis. For my question, I used the data frame that these authors make available (https://github.com/OrsolyaVincze/VinczeEtal2021Nature/blob/main/SupplementaryData.xls).
The work carried out by Vincze et al. (2022) studied a simple measure of cancer mortality risk (CMR) in mammals in the relationship with variables like body size or lifespan. Addionatilly, it was necessary to incorporate phylogenetic information due to the lack of independence between the species analyzed.
The data frame contains:
Variable Response (CMR) = the ratio between the number of cancer-related deaths (Neoplasia) and the total number of individuals whose postmortem pathological records were entered in the database (knownDeaths).

Like a first approach, we think of the following model:
(cbind(Neoplasia, knownDeaths-Neoplasia) ~ log(BodyMass)+log(lifeexp), data = data, family = "binomial",cov_ranef = list(sp = phy),add.obs.re = FALSE)

However, when trying to make this model, we find some inconveniences.
First of all, I would like to ask to confirm my suspicions: the phyr library does not allow models to be made without any random variable,? Since I did not find any library that allows me to run a binomial model for proportions considering the phylogenetic information and not including any random variable. I ask this because it is possible that a binomial model (for proportions) does not have overdispersion, so in this case for these data I would not need to use mixed models.
However, it is known that binomial distribution models tend to suffer from overdispersion, so we decided to incorporate a random effects variable (OLRE) to take into account overdispersion. In this way, we can use the library without any inconvenience, using the following model:

M1=pglmm(cbind(Neoplasia, knownDeaths-Neoplasia) ~ log(BodyMass)+log(lifeexp)+ (1|OLRE), data = data, family = "binomial",cov_ranef = list(sp = phy),add.obs.re = FALSE)

However, this model shows a bad fit and overdispersion caused by an excess of 0.
M1_FIT
M1_Overdis_by zeros
M1_overdisp

So, as a next step, we decided to implement zero-inflated models that this library would allow us to incorporate, using the following model:
M2=pglmm(cbind(Neoplasia, knownDeaths-Neoplasia) ~ log(BodyMass)+log(lifeexp)+ (1|OLRE), data = data, family = "zeroinflated.binomial",cov_ranef = list(sp = phy),add.obs.re = FALSE,bayes=TRUE,verbose=TRUE), but.....

image
Consider entering the response variable as the proportion (CMR), but according to inla I have to enter it as an interger, but my problem is that I have no way to define the weights or the Ntrial for that proportion.

I admit that I have problems with the syntax of my model in relation to the response variable.
Could someone help me with the syntax to be able to run the model I need?

Thank you very much for your time and help.
Nicolas, thanks, thanks, thankssssssss........
Attached the database and the script used

`library(ggplot2)
library(ape)
library(car)
library(phytools)
library(phylolm) # phyloglm
library(phyr)
library(DHARMa)
library(dplyr)

str(data)
data$OLRE=as.factor(data$OLRE)
data$Species=as.factor(data$Species)
data$order=as.factor(data$order)

Species-specific body mass

data$BodyMass <- (data$MaleMeanMass + data$FemaleMeanMass)/2

phylogeny

phy <- read.nexus("consensus_phylogeny.tre") # consensus vertlife tree
phy <- bind.tip(phy, "Cervus_canadensis", where = which(phy$tip.label=="Cervus_elaphus"),
edge.length=0.5, position = 0.5)
phy <- bind.tip(phy, "Gazella_marica", where = which(phy$tip.label=="Gazella_subgutturosa"),
edge.length=0.5, position = 0.5)

#firts model

M0=pglmm(cbind(Neoplasia, Nodeadbycancer) ~ log(BodyMass), data = data, family = 'binomial',cov_ranef = list(sp = phy))

No random terms specified, use lm or glm instead, There is a possibility that running a model without random effects?

M1=pglmm(cbind(Neoplasia, knownDeaths-Neoplasia) ~ log(BodyMass)+log(lifeexp)+ (1|OLRE), data = data, family = "binomial",cov_ranef = list(sp = phy),add.obs.re = FALSE)
summary(M1)

simulationOut<- simulateResiduals(fittedModel = M1, n = 250,refit = FALSE)
plot(simulationOut) # bad fit
testZeroInflation(simulationOut) # overdispersion by zeros
testOverdispersion(simulationOut) # overdispersion by not zeros is good

M2=pglmm(cbind(Neoplasia, knownDeaths-Neoplasia) ~ log(BodyMass)+log(lifeexp)+ (1|OLRE), data = data, family = " zeroinflated.binomial",cov_ranef = list(sp = phy),add.obs.re = FALSE,bayes=TRUE)
`
dataIZ.txt

@rdinnager rdinnager self-assigned this Jul 26, 2022
@daijiang
Copy link
Owner

It seems that @rdinnager is working on this. @Flaiba if you don't need random terms, I think the phylolm package should work.

@rdinnager
Copy link
Collaborator

rdinnager commented Jul 27, 2022

Hi @Flaiba ,

Thanks for the detailed bug report. I believe the problem here is that your formula is not getting translated into an INLA formula properly, and it's because currently doing an operation inside cbind in a formula is not supported. I will work on fixing the bug (and also providing a way to pass Ntrials argument through to INLA directly). As a workaround you should be able to create a new column in your data.frame for the failed trials and then use that in the cbind, and it should work. So something like this:

data$fails <- data$knownDeaths - data$Neoplasia
M2 <- pglmm(cbind(Neoplasia, fails) ~ log(BodyMass) + log(lifeexp) + (1|OLRE), data = data, family = "zeroinflated.binomial", cov_ranef = list(sp = phy), add.obs.re = FALSE, bayes = TRUE)

Let me know if the above does not work.

@Flaiba
Copy link
Author

Flaiba commented Jul 27, 2022

Hi daijiang and rdinnager
Thanks for the reply.
As for the model with random effects, I may be wrong, but phylolm is not just for logistic regression and Poisson distribution. In other words, this package only can be used for the binomial distribution with a dichotomic response (success/fail), not for proportions like as (cbind(success, fails). Am I correct?

On the other hand, I think the problem is as you said rdinnager. Unfortunately, I used your code, but I still have the same problem.
When moving to INLA, there is a problem with the code concerning how to codify the variable response. Again appear the Error in the formula:

"Error in formula.character(object, env = baseenv()) :
invalid formula "Neoplasia": not a call "

The response variable is not detected. :(

Thank you both very much. Packages like this are essential to making better models and the possibility of asking them for help is an unbeatable chance to continue learning. Thanks.

@rdinnager
Copy link
Collaborator

Hi @Flaiba , Thanks for giving that a try. After the error occurs, could you run traceback() and paste the results here? It would also be helpful if you could make a minimal reproducible example. I can't run the code you pasted here because I don't have the tree file you used ("consensus_phylogeny.tre").

@Flaiba
Copy link
Author

Flaiba commented Jul 28, 2022

Hi! first of all, I apologize for not attaching the phylogenetic tree.
I copy the link to the phylogenetic tree.
https://github.com/OrsolyaVincze/VinczeEtal2021Nature/blob/d2f8e2dff8e64393ce3512f8b8490507fa9ae803/consensus_phylogeny.tre
These are the syntaxes I tried with the subsequent error message:

M2=pglmm(cbind(Neoplasia, knownDeaths-Neoplasia) ~ log(BodyMass)+log(lifeexp)+ (1|OLRE), data = data, family = "zeroinflated.binomial",cov_ranef = list(sp = phy),add.obs.re = FALSE,bayes=TRUE,verbose=TRUE)
Error in formula.character(object, env = baseenv()) :
invalid formula "Neoplasia": not a call

data$fails <- data$knownDeaths - data$Neoplasia
M3 <- pglmm(cbind(Neoplasia, fails) ~ log(BodyMass) + log(lifeexp) + (1|OLRE), data = data, family = "zeroinflated.binomial", cov_ranef = list(sp = phy), add.obs.re = FALSE, bayes = TRUE)

Error in formula.character(object, env = baseenv()) :
invalid formula "Neoplasia": not a call

and Finally, I tried used the ratio of Neoplasia/KnownDeadths (CMR), but the problem is with the INLA

Error in inla.inlaprogram.has.crashed() :
The inla-program exited with an error. Unless you interupted it yourself, please rerun with verbose=TRUE and check the output carefully.
If this does not help, please contact the developers at help@r-inla.org.
Error in inla.core.safe(formula = formula, family = family, contrasts = contrasts, :
*** Fail to get good enough initial values. Maybe it is due to something else.

Thanks again, I hope that with the information from the phylogenetic tree you can replicate this problem, and finally get a solution.

daijiang added a commit that referenced this issue Aug 1, 2022
fixes #73, issue with translating formula
@daijiang daijiang reopened this Aug 1, 2022
@rdinnager
Copy link
Collaborator

Hi @Flaiba. I've now made a patch which I think should fix your issue. To try it just install the development version of phyr from github using remotes::install_github("daijiang/phyr"). Please let me know if you if this resolves your problem.

@Flaiba
Copy link
Author

Flaiba commented Aug 2, 2022

I am very grateful for your help. Both are very helpful. I can run this model perfectly.

M1=pglmm(cbind(Neoplasia, Nodeadbycancer) ~ LitterSize +log(BodyMass)+log(lifeexp)+(1|OLRE), data = datapaper, family = "zeroinflated.binomial",cov_ranef = list(sp = phy),add.obs.re = FALSE,bayes = TRUE)

image

However, Can I use this space to ask two things related to this model? Firstly, I saw in this blog/help (https://daijiang.github.io/phyr/articles/phyr_example_empirical.html) for the package that the models can be checked using the Dharma package. In the case of this model, perhaps after this actualization, I can not check the model by Dharma.

simulationOut<- simulateResiduals(fittedModel = M1, plot = FALSE)
plot(simulationOut)

image
Maybe the problem is why the distribution of the model is zero-inflated? In this case, I probed with a binomial distribution "common" and finally, I saw this problem:

M2=pglmm(cbind(Neoplasia, Nodeadbycancer) ~ LitterSize +log(BodyMass)+log(lifeexp)+(1|OLRE), data = datapaper, family = "binomial",cov_ranef = list(sp = phy),add.obs.re = FALSE,bayes = TRUE)
summary(M17)
simulationOut<- simulateResiduals(fittedModel = M2, plot = FALSE)
plot(simulationOut)

image
Is it possible using Dharma to check these models?

Additionally, and I promise this is the last asked.
In my analysis, I have a variable that has 3 levels. Since this variable is significant, I need to do contrasts, for example, Tukey. So, what package can I use to do the contrasts that support objects of phyr package? Because the emmeans does not support this package

image

Thanks, thanks a lot for your time

@rdinnager
Copy link
Collaborator

Hmm.. I'm not surprised the zero inflated model didn't work with DHARMa, but the binomial one definitely should work. I'll look into that and get back to you. I will see if it is possible to get the zero inflated one working too. As for your final question, there are several possibilities. phyr doesn't support emmeans currently. For the Bayesian model, what I would do is extract the posterior distribution of the relevant coefficients to reconstruct the categorical means and plot them. It should be obvious which categories differ. If you have an a-priori hypothesis for which levels should be compared, there are two options:

  1. Reparameterise your model to make the comparison (that is, change how dummy variables are generated to make the right contrast(s)). This isn't always possible.
  2. Calculate the required contrasts from a sample from the posterior distribution, and then summarise across many posterior samples to get a distribution of the contrast values.

We intent to add more user friendly options for doing this sort of thing in the future, but at the moment it might require manually extracting useful quantities from the model object. Supporting emmeans would probably be a good way forward for supporting these use cases.

How did you determine that the categorical variable was 'significant'? Were you using the maximum likelihood or Bayesian methods for this?

@rdinnager
Copy link
Collaborator

In the mean-time, you may want to look into the emmeans::qdrg() function (https://rdrr.io/cran/emmeans/man/qdrg.html), and see if you can manually extract the useful quantities from the communityPGLMM object.

@Flaiba
Copy link
Author

Flaiba commented Aug 8, 2022

Sorry for the delay in responding, I had been away for the last few days. It will be great if the validation of models works employing Dharma. In the same way, if phyr objects can be used in other packages like emmeans, they can help users gravitate towards using these tools. However, despite the improvements that may be implemented in the future, I consider that this package is an excellent tool for making complex models, providing a better approximation regarding the options available so far. Once again, thanks for the development and help provided to end users.
I will be waiting for the news.
Finally, regarding how I detected "significant" differences between levels of a variable in a model: For the case of the model with bayes=FALSE, I saw that one of the three levels of this variable presented significant differences with the level in the intercept. On the other hand, for the model with the bayesian approach, the credibility interval for the same level that in the frequentist model did not include 0.
However, it is relevant for my question, regardless of the approach used (frequentist or Bayesian), that it be able to make the comparison between the levels observed in summary.

I will follow your suggestions.

@jenmunoz
Copy link

Hi,

I am following up on this issue, and I am wondering if the zero-inflated model can now be examined with DHARMa.

I am doing a PGLMM using the Bayesian approach with zero.inflated data and having two issues.

  1. The function plot_bayes is giving me an error ( but the error does not occur with Poisson or other families)
    "Error:! Column 4 must be named.
    Use .name_repair to specify repair.
    Caused by error in repaired_names():
    ! Names can't be empty.
    ✖ Empty name found at location 4.

  2. DHARMa
    Error in get(as.character(FUN), mode = "function", envir = envir) :
    object 'zeroinflatedpoisson1' of mode 'function' was not found

Thanks for your help ,

@rdinnager
Copy link
Collaborator

Hi @jenmunoz . Unfortunately I still haven't had a chance to look into getting the zero-inflated model working with DHARMa, but it would definitely be good to get this functionality. Thank you also for the bug report for plot_bayes() with the zero inflated model as well. I won't have time to look at this issue this week, but I can get to it next week.

To expedite the process, It would be very helpful if you were able to make a minimal reproducible example for this issue (perhaps with simulated data)?

@jenmunoz
Copy link

jenmunoz commented Mar 13, 2023

Hi @rdinnager,
The issue with zeroinflated.binomial and the plot_bayes() seems to arise because there is a Random effect that is generated which is not named. I assume this is a 1|obs Random effect. If I get that r.e named manually, then it is possible to plot it.

The issue with DHARMa not working seems to apply to all pglmm with bayes=TRUE. For instance, if I try to just create posterior predictive simulations and feed them to createDHARMa does not work because the function phyr::simulate.communityPGLMM is not working for bayes=TRUE models.

Thanks for your reply, here is a reproducible example.

DATA

data_df<-read.csv("7.df_data_example.csv", na.strings =c("","NA")) %>% na.omit()
phylo<-read.nexus("1_consensus_birdtree.nex")
unique(data_df$species_jetz )

STRUCTURE

ectos_df$species_jetz<-as.factor(ectos_df$species_jetz)
ectos_df$elevation<-as.numeric(ectos_df$elevation)
ectos_df$sociality<-as.factor(ectos_df$sociality)
ectos_df$Powder.lvl<-as.factor(ectos_df$Powder.lvl)

MODEL

pglmm_bayes <-phyr::pglmm(total_parasites~sociality+scale(elevation)+(1|species_jetz__)+(1|Powder.lvl),
data = data_df,
family ="zeroinflated.poisson", #POISSON
cov_ranef = list(species_jetz= phylo), #class phylo
bayes = TRUE,
verbose=FALSE,
prior = "inla.default") # consider using add.obs.re = T

ASSUMPTIONS CHECK ( )

simulationOutput<-DHARMa::simulateResiduals(fittedModel= pglmm_bayes , plot= TRUE, re.form = NULL ) #quantreg=T

#PLOT
summary(pglmm_bayes)
phyr::plot_bayes(pglmm_bayes)

Error in dplyr::as_tibble():
! Column 4 must be named.
Use .name_repair to specify repair.

Caused by error in repaired_names():
! Names can't be empty.
✖ Empty name found at location 4.

Create new data ( simulate)

simulate.communityPGLMM(pglmm_bayes, nsim = 10, seed = NULL, re.form = NULL)

Links to data
phylo tree:
https://github.com/jenmunoz/Chapter3_parasites/blob/857db3b15bfb9dfe9b18e0aa3c8ac2388404a209/reproducible_example/1_consensus_birdtree.nex

data_example:
https://github.com/jenmunoz/Chapter3_parasites/blob/857db3b15bfb9dfe9b18e0aa3c8ac2388404a209/reproducible_example/7.df_data_example.csv

@rdinnager
Copy link
Collaborator

Thank you for the example and the additional investigation of the bug. Much appreciated. I will try and get to a fix as soon as I can.

@rdinnager rdinnager mentioned this issue May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants