Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tweedie distribution: likelihood, pdf and sampling methods #7310

Open
lorentzenchr opened this issue Feb 6, 2021 · 16 comments
Open

Add Tweedie distribution: likelihood, pdf and sampling methods #7310

lorentzenchr opened this issue Feb 6, 2021 · 16 comments

Comments

@lorentzenchr
Copy link
Contributor

lorentzenchr commented Feb 6, 2021

Is your feature request related to a problem? Please describe

I'd like to have the Tweedie distribution in order to:

  • calculate the likelihood
  • calculate the pdf
  • generate random numbers

Especially the likelihood is interesting for an MLE estimate of the power parameter of a Tweedie GLM, see #2858.

Describe the solution you'd like

The next release of scipy, v1.7.0, will have the special function wright_bessel, see merged PR scipy/scipy#11313. This function greatly simplifies the computation of the Tweedie pdf.

Additional context

In scipy/scipy#11291, it was decided not to have the distribution in scipy itself, only the needed special function.
See also the scipy mailing list https://mail.python.org/pipermail/scipy-dev/2020-March/024074.html.

@josef-pkt
Copy link
Member

@lorentzenchr Do you already have some code for some of the distribution methods like logpdf, rvs, moments ?

I started to look at it again, but I don't remember enough and haven't looked at it in years.

@lorentzenchr
Copy link
Contributor Author

@josef-pkt I don't. But now that wright_bessel is available, it shouldn`t be too hard. The most important is to have a reliable source for the formulae.

The Exponential Despersion Model and therefore also the Tweedie family has two different formulations: additive and reproductive (see wikipedia). Which one to choose? I often prefer the reproductive one.

@lorentzenchr
Copy link
Contributor Author

lorentzenchr commented Jul 3, 2021

In Series evaluation of Tweedie exponential dispersion model densities Eq.(1-2):

pdf(y, mu, p, phi) = f(y, theta, phi) = a(y, phi) exp(1/phi (y theta - kappa(theta)))
kappa = cumulant function
theta = function of expectation mu and power p
alpha = (2-p)/(1-p)
for 1<p<2:
a(y, phi) = 1/y * wright_bessel(-alpha, 0, (p-1)**alpha/(2-p) / y**alpha) / phi**(1-alpha)

@josef-pkt
Copy link
Member

AFAICS, we need the reproductive version for GLM.
wikipedia: For fixed parameter sigma^2, the ED(\mu ,\sigma^2) is a natural exponential family.

statsmodels GLM is a one parameter linear exponential family (LEF), conditional on variance parameters.

When we have the full loglikelihood, then we can use the profile likelihood to estimate the variance parameters in a second step.

(For example we have several versions of negativebinomial in statsmodels.discrete, but only the NB2 version is a LEF and included in GLM.)

@josef-pkt
Copy link
Member

reference with full MLE versus QMLE/M-estimator (which we currently have as standard GLM with irls or newton, loglike not used)

Wagner Hugo Bonat & Célestin C. Kokonendji (2017): Flexible Tweedie
regression models for continuous data, Journal of Statistical Computation and Simulation, http://dx.doi.org/10.1080/00949655.2017.1318876

@josef-pkt
Copy link
Member

josef-pkt commented Jul 4, 2021

scipy 1.7.0 was just released in June. I'll wait until Fall to update winpython.
So, I'm not working on this right now.

I found an old gist of mine with simulating gamma compound poisson for rvs, and parameter conversion between compound poisson and tweedie.
(I didn't look at the details again, When I started to look at conversion in wikipedia https://en.wikipedia.org/wiki/Compound_Poisson_distribution#Compound_Poisson_Gamma_distribution , the parameter names differ from my notebook.)
https://gist.github.com/josef-pkt/b8354b64959ebaf8a7b51c49cecdb54c

notebook was written for #2915

@josef-pkt
Copy link
Member

another article, based on quick look, it uses deviance, or loglike without normalizing constant as objective function.

Wei Qian, Yi Yang & Hui Zou (2016) Tweedie’s Compound Poisson Model
With Grouped Elastic Net, Journal of Computational and Graphical Statistics, 25:2, 606-625,
http://dx.doi.org/10.1080/10618600.2015.1005213

@lorentzenchr
Copy link
Contributor Author

@josef-pkt Where would you like to place this? Under statsmodels.distributions in a separate file?

Which API to follow? An own class that quacks (duck-typing) almost like scipy.stats.rv_continuous?

@josef-pkt
Copy link
Member

We need the loglikelihood in genmod families so it can be used with GLM, e.g. for profile likelihood for the non-mean parameters. That can then be used immediately with GLM.

The full distribution as scipy distribution like class would go in statsmodels.distributions, new module.

In some families we use different parameterization in the regression models than the standard distribution parameterization.
We use get_distribution method to get the scipy compatible distribution from a regression estimate, e.g. for predictive distribution of new endog y.

Because scipy doesn't have a base class for distributions that are mixed mass-points and continuous, our distribution version might not be able to inherit from rv_continuous.

Eventually, we will write a full MLE model in, most likely statsmodels.othermod, that estimates mean and dispersion simultaneously. (similar to new beta regression)

@josef-pkt
Copy link
Member

@lorentzenchr Thanks for working on this and getting the special functions into scipy.

@josef-pkt
Copy link
Member

I haven't thought much about the class structure for the tweedie distribution yet.

I guess we can write a subclass of rv_continuous for the positive part of tweedie y>0 that can reuse the generic methods in scipy.stats. and then mix it with the mass point distribution.

Our distributions.discrete have zero-inflated models that mix in some methods like cdf with the underlying distribution. But those can still inherit generic methods from rv_discrete.

@h-vetinari
Copy link

This has been released in 0.14 through #8560, right?

@josef-pkt
Copy link
Member

@h-vetinari
yes

@diegodebrito
Copy link

@lorentzenchr @josef-pkt

I am having some issues when fitting fitting constrained models and it is related to the discussion about Scipy's wright_bessel function above. When fitting models with constraints, Statsmodels calls the loglike function instead of using reweighted least squares. The problem I'm having is that wright_bessel returns inf for some values, even with power = 1.5. This makes the sum of the likelihood (the objective function in this case) value to be -inf and the minimize option gets stuck there.

@lorentzenchr Did you guys decide to keep the implementation from scratch instead of using wright_bessel in GLUM? If so, was this the reason?

@josef-pkt I can open a separate issue for this and potentially work on a solution. I haven't looked into GLUM's code because I'm unsure about license/permission.

@thequackdaddy Do you remember having any issues with exploding scaling when computing log-likelihood in the Tweedie package?

Thanks!

@josef-pkt
Copy link
Member

@diegodebrito Yes, please open a separate issue.

Provide a working example that shows the problem, so we can look into this.

besides the underlying code, assuming returning inf is approximately correct in the special function.
Possible reasons for numerical problems

  • scaling problems that cause over- or underflow during optimization
  • corner cases, fit approaches a corner case where some functions go to inf or zero.
  • in other cases we have numerical issues with code log(exp(x)) fails when exp overflows for larger x

Those are cases we ran into in other models. I have not experimented with Tweedie long enough to have an idea where it is fragile.

What kind of constraints on parameter do you have? linear restrictions?
There might be approaches to help in the estimation.
For example, if the unconstrained estimation has a valid result, then we could get a one-step approximation to the constrained parameters to get a better starting value.

@diegodebrito
Copy link

Thanks for your quick answer, @josef-pkt.

I opened an issue and added as much info as I could: #9234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants