Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design how we're going to extend Bambi #544

Open
tomicapretto opened this issue Jul 11, 2022 · 21 comments
Open

Design how we're going to extend Bambi #544

tomicapretto opened this issue Jul 11, 2022 · 21 comments
Labels
Discussion Issue open for discussion, still not ready for a PR on it. enhancement

Comments

@tomicapretto
Copy link
Collaborator

The following is a list of features we're missing (or covering only partially) in Bambi

  • Distributional models (we model more than the mean parameter of the response)
  • Multivariate models (ie the response is a multivariate distribution)
  • Non-linear models.
  • Survival models/Models with censored data.
  • Ordinal models.
  • Zero and Zero-One inflated models.

The last three points (survival/censored, ordinal, and zero/zero-one inflated) are covered by the first points (distributional and multivariate) if we implement them appropriately. The third point, non-linear models, is a separate problem. I'll try to add a couple of things I've been thinking about lately.


Distributional models

Some API proposals

formula = bmb.formula(
    "y ~ a + b",
    "sigma ~ a",
)
priors = {
    "a": bmb.Prior("Normal", mu=0, sigma=1),
    "b": bmb.Prior("Normal", mu=0, sigma=1),
    "sigma_Intercept": bmb.Prior("Normal", mu=0, sigma=1),
    "sigma_x": bmb.Prior("Normal", mu=0, sigma=1)
}
link = {"mu": "identity", "sigma": "log"}
model = bmb.Model(formula, data, priors, link)
  • We need a formula object where we can have multiple formula parts. I propose to call it bmb.formula(). There's an open discussion in Add Bambi formula object #423.
  • We need a name for the terms associated with the auxiliary parameters. I propose to use {param}_{term} such as sigma_x.
  • We need a transformation of the linear predictor of the auxiliary parameters into something that makes sense. I propose we have defaults for the built-in families that can be overridden with a dictionary. Note a dictionary is not supported by the link argument in Model now.

I haven't thought much more about the implementation details, where other concerns may appear. For the moment, I think it's good to discuss about the API we want. Any objections, any suggestions, any drawbacks I'm not seeing?

Multivariate models

We currently support some multivariate families, such as "categorical" and "multinomial". I feel we should think more about the implementation. I think we could make it more general so we don't need to handle all cases as special cases. With that said, I think there are other things to discuss.

  • What do we use to indicate a multivariate response?
"c(y1, y2, ..., yn) ~ ..."
"mvbind(y1, y2, ..., yn) ~ ..."
bmb.formula("y1 ~ ...", "y2 ~ ...", "y3 ~ ...")

note the last alternative allows for different predictors to be included in each case.

  • How much do we want to support multivariate families?

I'm not an expert in this area but I have the feeling that things can get very complex very quickly. And I'm not sure if this is a highly required feature.

For now, I tend to think we should have minimum support that allows people and us to explore the possibilities available as well as refine the API.

Non-linear models

This has been discussed a little here #448. I think it's a very nice to have feature but I don't have it solved in my mind yet. The only thing I have are some API proposals, but I don't see how to implement them without a huge effort.

First:

formula = bmb.formula(
    "y ~ b1 * np.exp(b2 * x)",
    nlpars=("b1", "b2")    
)

But this comes with a major problem, how do we override the meaning of the * operator in the formula syntax? If we pass something like that to formulae, it won't multiply things by b1 or b2, it will try to construct full interaction terms between the operands. I like how this approach looks but it would require a huge amount of effort to parse terms and parameters.

Another alternative would be to use a function.

def f(x, b1, b2):
    return b1 * np.exp(b2 * x)

formula = bmb.formula(
    "y ~ f(x, b1, b2)",
    nlpars=("b1", "b2")    
)

This would work on the formulae side, but again we would need to do parsing stuff to grab the non-linear relationship between the parameters (b1 and b2) and the predictor x. How do we handle arbitrarily complex functions? I'm not sure.

Survival models/Models with censored data.

#543 adds support for survival analysis with right-censored data. One drawback of the proposal is that family="exponential" always implies right-censored data. I think we should have something more general.

I imagine all the following cases working

bmb.Model("y ~ ...", data, family="exponential")
bmb.Model("censored(y, status) ~ ...", data, family="exponential")
bmb.Model("censored(y, status, 'left') ~ ...", data, family="exponential")

The challenge is that censored() should be a function that returns an array-like structure (so formulae knows how to handle it) with some attribute that enables Bambi to figure out the characteristics of the censoring. I'm not sure how to implement this but I know it's feasible.

Ordinal models and Zero and Zero-One inflated models.

I think these ones come almost for free if we do a good job with the tasks above.

@tomicapretto tomicapretto added Discussion Issue open for discussion, still not ready for a PR on it. enhancement labels Jul 11, 2022
@canyon289
Copy link
Collaborator

Imo I think non linear models given its effort should be lowest priority

@tomicapretto
Copy link
Collaborator Author

Imo I think non linear models given its effort should be lowest priority

I agree. It would be a lot of work without knowing if it will work.

@aflaxman
Copy link

Imo I think non linear models given its effort should be lowest priority

I agree. It would be a lot of work without knowing if it will work.

One route to nonlinear models (and perhaps some of the other items on your exciting list) would be to make a "transcompiler" similar to the request in #539 ; I could imagine using this in an exploratory research workflow to start with a linear model, generate python code for it, and then iteratively modifying the autogenerated code to include nonlinear elements.

@aloctavodia
Copy link
Collaborator

I wonder if defining the non-linear function in Aesara could help us to get the necessary information in a general way (@ricardoV94, @rlouf).
For the record, I try to solve #539 by inspecting the PyMC model, but then I abandoned that route and moved into using the structure information already available in Bambi (that is still WIP).

@rlouf
Copy link

rlouf commented Jul 15, 2022

You can definitely analyze any aesara graph a user provides and see if it matches a certain pattern. May if I knew better how bambi works I could tell you if it would be useful in this case.

@canyon289
Copy link
Collaborator

Other than the technical implementation frankly I dont think itll be all that usefull and theres not a huge userbase for it. If people want non linear models they can just use PyMC to code those up.

The other use cases imo are much easier to implement in Bambi and will have a wider userbase.

@don-jil
Copy link

don-jil commented Jul 21, 2022

I like your proposed API for censored data. Very clean.

@tomicapretto
Copy link
Collaborator Author

tomicapretto commented Oct 22, 2022

I have new ideas for distributional models

Instead of this

formula = bmb.formula(
    "y ~ a + b",
    "sigma ~ a",
)
priors = {
    "a": bmb.Prior("Normal", mu=0, sigma=1),
    "b": bmb.Prior("Normal", mu=0, sigma=1),
    "sigma_Intercept": bmb.Prior("Normal", mu=0, sigma=1),
    "sigma_x": bmb.Prior("Normal", mu=0, sigma=1)
}
link = {"mu": "identity", "sigma": "log"}
model = bmb.Model(formula, data, priors, link)

have this (notice the priors)

formula = bmb.formula(
    "y ~ a + b",
    "sigma ~ a",
)
priors = {
    "y": {
        "a": bmb.Prior("Normal", mu=0, sigma=1),
        "b": bmb.Prior("Normal", mu=0, sigma=1),
    },
    "sigma": {
        "Intercept": bmb.Prior("Normal", mu=0, sigma=1),
        "a": bmb.Prior("Normal", mu=0, sigma=1)
    }
}
link = {"mu": "identity", "sigma": "log"}
model = bmb.Model(formula, data, priors, link)

It adds more structure and prevents us from having to parse strings to decide to which response the prior corresponds to. Also, "_" is very common in variable names, so it's highly likely we get it wrong.

This does open new questions though. For example

  • What names do we use for the PyMC parameters?
    • I think it makes sense to use sigma_Intercept and sigma_a.
    • The rule would be: If the response is a parameter, prepend it's name. If not, do nothing.
  • These modified names will impact the InferenceData object, and thus we still need to perform some string manipulation.
    • This does not make me very happy. I'll keep thinking.

@zwelitunyiswa
Copy link

zwelitunyiswa commented Oct 22, 2022 via email

@tomicapretto tomicapretto pinned this issue Nov 5, 2022
@tomwallis
Copy link

Other than the technical implementation frankly I dont think itll be all that usefull and theres not a huge userbase for it. If people want non linear models they can just use PyMC to code those up.

The other use cases imo are much easier to implement in Bambi and will have a wider userbase.

I would like to chime in on the nonlinear models front. At least for people in my research field this would be a very welcome feature. Computational cognitive models are often nonlinear (and sometimes distributional). Popular examples of the latter are things like drift diffusion models, but I most commonly work with univariate psychometric functions of the form

$$\Psi(x; m, w, \gamma, \lambda) = \gamma + (1 - \gamma - \lambda) S(x; m, w)$$

where $S$ is a sigmoidal function like a cumulative Gaussian, $m$ and $w$ are "threshold" and "width" parameters (location and scale), and $\gamma$ and $\lambda$ "squish" the function between upper and lower bounds between zero and one. $\Psi$ is the probability of a response (e.g. a correct response) as a function of a stimulus level $x$, usually through a binomial distribution. Note that the model is nonlinear because $\gamma$ and $\lambda$ can be free parameters; if these were fixed to 0 and 1 respectively then this would boil down to a logistic / probit regression. These models are also very related to models from item response theory.

While it's true that one can build these models by hand in PyMC, the convenience of a wrapper package like bambi is to be able to quickly build and test versions of the models with different LMM structures on the different nonlinear parameters, without handling all the sampler optimisation tricks and interpretation / plot wrappers every time.
This is a great boon for fast and robust model development and comparison, as well as being able to have a relatively gentle tool for students to learn.

I've previously used brms for this work, but being an almost-pure-python user nowadays I'd love to have something similar available. While I don't know the backend of Bambi, I do like the API for nonlinear models in brms. I'm not sure if these examples would help inspire design ideas.

formula = bmb.formula(
    "y ~ b1 * np.exp(b2 * x)",
    nlpars=("b1", "b2")    
)

But this comes with a major problem, how do we override the meaning of the * operator in the formula syntax? If we pass something like that to formulae, it won't multiply things by b1 or b2, it will try to construct full interaction terms between the operands.

The brms workaround for this is basically to have a nl = True flag in the appropriate place (or here perhaps an if nlpars is not None) that can be used to change the parsing logic (i.e. count * as a multiply operator). The (G)LMMs could then be built on top of each nonlinear parameter as an extra argument:

formula = bmb.formula(
    "y ~ b1 * np.exp(b2 * x)",
    b1 ~ 1 + (1 | g1), 
    b2 ~ 1 + (1 | g2\g1),
    nlpars=("b1", "b2"),
)

Please let me know if this would be of further interest.

@tomicapretto
Copy link
Collaborator Author

@tomwallis first of all thanks a lot for chiming in!

We're really close to having distributional models! I need some extra time to polish the details #607 and then we're done :)

For the non-linear part, I still anticipate non-trivial problems. I'm not saying it's impossible at all though. I'm already aware of the syntax to specify non-linear parameters in brms, but I couldn't study the details about how they are handled internally. I suspect it's easier to manipulate the expression in R, because of the meta-programming capabilities of the language, and in Python, we would need to create a larger machine to do so.

For example, let's say we have the expression b1 * np.exp(b2 * x) + np.log(c * y). We need machinery that allows us to pass to formulae that the variable that the expression that goes the design matrix creation is only x + y, but we also need to store a reference to the full expressions so we can handle the priors and the PyMC computation. And to do this in a general way is not trivial.

But anyway, thanks a lot for your input, I'm really happy to see there are more people interested in this feature. I will be pinging you when I start working on it if you want :)

@tomicapretto
Copy link
Collaborator Author

To get nonlinear terms working I think we need to do the following

  1. Extract individual terms, considering only the sum + as a splitting operator DONE https://gist.github.com/tomicapretto/b938614811774b8ab4d9127bdb92e138
  2. Detect which terms use non-linear parameters TO DO
  3. Manipulate the expression in terms where non-linear parameters play a role
    to pass only the 'variable'/'predictor' to formulae, at the same time we keep the expression somewhere
    so it can be used later (e.g. when adding the non-linear term to the PyMC model) TO DO

@tomwallis
Copy link

Thanks @tomicapretto for your positive reply! Yes, please ping me back when you start to invest time.

@paul-buerkner
Copy link

These look like nice features indeed! If you have any questions about brms' internal structure and procedures of representing some of the more advanced models, please let me know.

@tomicapretto
Copy link
Collaborator Author

@paul-buerkner definitely I would love it!

How would you prefer to do it? I could...

  • Open issues
    • Here, tagging you
    • In the brms repo
  • Send an e-mail
  • Have a meeting

I think that for specific technical questions an issue may be a good alternative (since it's also public). But if at some point the conversation turns to other broader topics such as design choices, we could have a meeting.

What do you think?

@paul-buerkner
Copy link

Opening issues here in the bambi repo and tagging me sounds good. And I agree we can have a meeting at some point if we feel it could be useful.

@tomicapretto
Copy link
Collaborator Author

@paul-buerkner here's my first question :)

What's the difference between a "Distributional model" and a "GAMLSS" (Generalized additive model for location, scale and shape? I took the name "Distributional models" from brms. But as far as I understand they are very close to GAMLSS, perhaps the difference is in the GAM part, and the distributional model doesn't necessarily imply GAMs? Or is there another more significant difference?

Thanks!

@paul-buerkner
Copy link

paul-buerkner commented Jan 5, 2023 via email

@tomicapretto
Copy link
Collaborator Author

Regarding naming conventions for terms in distributional models, someone pointed me via e-mail that {param}__{term}​ may be a better alternative than {param}_{term}​. I'm citing the arguments:

"It might sound negligible but having it as a convention can be very powerful in the future.
As evidence, see what the scikit-learn team did with model's names and their parameters in Pipeline objects.
The problem is python's convention to name variables with a single underscore, and then there's no way to differentiate when param​ ends and term​ starts. Using (the rarer) dunders mostly solves that."

My position for the moment is "I understand the point, and while I don't have a strong opinion about the existing solution, I don't see it as a very large problem either.".

I share the small chat here so anyone can share their thoughts

1 similar comment
@tomicapretto
Copy link
Collaborator Author

Regarding naming conventions for terms in distributional models, someone pointed me via e-mail that {param}__{term}​ may be a better alternative than {param}_{term}​. I'm citing the arguments:

"It might sound negligible but having it as a convention can be very powerful in the future.
As evidence, see what the scikit-learn team did with model's names and their parameters in Pipeline objects.
The problem is python's convention to name variables with a single underscore, and then there's no way to differentiate when param​ ends and term​ starts. Using (the rarer) dunders mostly solves that."

My position for the moment is "I understand the point, and while I don't have a strong opinion about the existing solution, I don't see it as a very large problem either.".

I share the small chat here so anyone can share their thoughts

@tomicapretto
Copy link
Collaborator Author

@paul-buerkner I have some extra questions, now about GAMs :)

Let's take the following example

library(brms)

pisa <- readr::read_csv("https://raw.githubusercontent.com/m-clark/generalized-additive-models/master/data/pisasci2006.csv")
code <- make_stancode(Overall ~ 1 + s(Income, bs = "cr"), data = pisa)
data <- make_standata(Overall ~ 1 + s(Income, bs = "cr"), data = pisa)

brm_gam <- brm(Overall ~ 1 + s(Income, bs = "cr"), data = pisa)
mgcv_gam <- mgcv::gam(Overall ~ 1 + s(Income, bs = "cr"), data = pisa)

Below, the summaries of brm_gam and mgcv_gam, respectively

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Overall ~ 1 + s(Income, bs = "cr") 
   Data: pisa (Number of observations: 54) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Smooth Terms: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(sIncome_1)     5.13      3.14     1.40    13.52 1.01      482      598

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   470.69      4.15   462.28   478.63 1.00     4515     2753
sIncome_1     8.04      1.29     5.48    10.60 1.00     3259     2654

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    31.67      3.51    25.65    39.42 1.00     1185      624
Family: gaussian 
Link function: identity 

Formula:
Overall ~ 1 + s(Income, bs = "cr")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  470.444      4.082   115.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
            edf Ref.df     F p-value    
s(Income) 6.895  7.741 16.67  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =    0.7   Deviance explained = 73.9%
GCV = 1053.7  Scale est. = 899.67    n = 54

The Stan code produced is

// generated with brms 2.18.0
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  // data for splines
  int Ks;  // number of linear effects
  matrix[N, Ks] Xs;  // design matrix for the linear effects
  // data for spline s(Income, bs = "cr")
  int nb_1;  // number of bases
  int knots_1[nb_1];  // number of knots
  // basis function matrices
  matrix[N, knots_1[1]] Zs_1_1;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  vector[Ks] bs;  // spline coefficients
  // parameters for spline s(Income, bs = "cr")
  // standarized spline coefficients
  vector[knots_1[1]] zs_1_1;
  real<lower=0> sds_1_1;  // standard deviations of spline coefficients
  real<lower=0> sigma;  // dispersion parameter
}
transformed parameters {
  // actual spline coefficients
  vector[knots_1[1]] s_1_1;
  real lprior = 0;  // prior contributions to the log posterior
  // compute actual spline coefficients
  s_1_1 = sds_1_1 * zs_1_1;
  lprior += student_t_lpdf(Intercept | 3, 488, 50.4);
  lprior += student_t_lpdf(sds_1_1 | 3, 0, 50.4)
    - 1 * student_t_lccdf(0 | 3, 0, 50.4);
  lprior += student_t_lpdf(sigma | 3, 0, 50.4)
    - 1 * student_t_lccdf(0 | 3, 0, 50.4);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept + Xs * bs + Zs_1_1 * s_1_1;
    target += normal_lpdf(Y | mu, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(zs_1_1);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
}

I would like to check the following

  • Why brms has a slope for sIncome_1 while mgcv does not?
    • I checked the generated Xs and it's a matrix of shape (54, 1). I thought its values were the ones of Income but they are not. I then tried dividing by their standard deviation, but no match either. What is Xs here and how is it computed?
  • Where can I check how the matrix Zs_1_1 is computed?
  • Is there anything special in how you determine the prior for sds_1_1?

Again, thanks a lot for brms and your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Issue open for discussion, still not ready for a PR on it. enhancement
Projects
None yet
Development

No branches or pull requests

9 participants