Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: get_distribution, return 1-d instead of column frozen distribution #8780

Merged
merged 4 commits into from
Apr 13, 2023

Conversation

josef-pkt
Copy link
Member

@josef-pkt josef-pkt commented Apr 7, 2023

#8723

This also warns now on invalid __init__ kwargs in BetaModel closes #8779

I changed the get_distribution in discrete to return 1-d instead of 2d.
Some unit tests might fail. They didn't
Some parts don't fail because they are written to handle both 1-d or one 2-d column returns, they work around current inconsistency in return shape.

Not decided yet:
Do we add an option as_column=False to get_distribution?
Currently, I commented out the old column code, so I can reuse it if we add the option.
update not for now

Related missing:
GLM and OLS need results method (they only have model.get_distribution)

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 7, 2023

no failures so far, which is surprising or insufficient unit test coverage

in discrete_model only NegativeBinomial predict(which="prob") calls get_distribution, Poisson, GPP, NBP use the distribution directly
in count_model zeroinflated models never call get_distribution, _predict_prob also computes probabilites from scratch.

So, only NegativeBinomial predict(which="prob") could be wrong because of the shape change in get_distribution

(Now, I don't remember why I switched to column distribution if I didn't use it much internally in statsmodels.)

two unit tests in discrete that call get_distribution:
test_discrete check_distr function uses squeeze with all return
and test_predict test_distr I had already checked before, it was also designed to handle both shapes

@josef-pkt
Copy link
Member Author

I think I will not add work on RegressionModel/Results get_distribution to this PR.
AFAICS, weights in predict handling is only used in get_prediction and it's outsourced to module statsmodels.regression._prediction

This means adding results.get_prediction will not be so simple, unless it calls get_prediction, but that's overkill. We only need weights handling, not standard errors.

The predict method is currently only for mean prediction, first moment, which does not need (var_)weights.

@josef-pkt
Copy link
Member Author

refactoring bug

NegativeBinomial res.predict(which="prob") fails with shape mismatch, get_diagnostic plot_probs fails as a consequence
no unit test for it. I had largely ignored NB model in favor of NBP in recent changes and checks.

probs = res.predict(which="prob", y_values=np.arange(5))
is also broken now, y_values need to be a 2-d column if get_distribution distr args are 1d
(that's why I had use column distr args)

I can fix this by reshaping y_values internally for which = "predict", but users will need to reshape if they want vectorized evaluation of distr methods, e.g. cdf at different values.

@josef-pkt
Copy link
Member Author

something weird with deltacov if params has only one element

this works

res = smd.Poisson(y, x[:, :2]).fit()
res.get_prediction(which="prob", average=True).summary_frame()

but using one column exog raises exception in get_prediction
res = smd.Poisson(y, x[:, :1]).fit()

I'm in the middle of making some changes for which="prob", so no clean code or example.

NB, NBP and GPP don't fail (they have one extra params, so len(params)=2), but one parameter NB-geometric fails in the same way.

AFAIR, we don't have another get_prediction case that has multivariate prediction.
and I guess, I never checked the case before with only one parameter in a model.

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 13, 2023

likely reason for problem in DeltaCov with 1-element params

approx_fprime returns 1d gradient if len(params) == 1

​
from statsmodels.tools.numdiff import approx_fprime_cs, approx_fprime

x = np.arange(5)
def fun(params):
    return params * x
​
params = np.asarray([1])
jac = approx_fprime_cs(params, fun)
jac2 = approx_fprime(params, fun)
jac.shape, jac2.shape
((5, 1), (5,))

for len(params) = 2:

x = np.arange(10).reshape(-1, 2)
def fun(params):
    return (params * x).sum(-1)
​
params = np.asarray([1, 2])
jac = approx_fprime_cs(params, fun)
jac2 = approx_fprime(params, fun)
jac.shape, jac2.shape
((5, 2), (5, 2))

update
indirectly confirmed that approx_fprime is the problem.
I changed Poisson.predict for which="prob" to use _pmf instead of pmf, stats.poisson._pmf which does not raise an exception of the poisson parameter is complex. In that case deltacov uses approx_fprime_cs and res.get_prediction(which="prob", average=True) does not file anymore when len(params) is 1.

I cannot do the same for NB geometric because it uses get_distribution for predict which="prob", and scipy frozen distributions do not have the "private" methods like _pmf.

If we want to use complex step derivatives for predict/get_prediction, then we cannot go through the "public" methods of scipy.distributions, i.e. we need to implement it without going through get_distribution.

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 13, 2023

when I change the return of approx_fprime to skip the squeeze if len(params) is 1

    if n == 1:
        return grad.T
    else:
        return grad.squeeze().T

then I get unit test failures (run only in discrete.tests) where we get now a one column array instead of a 1-d array.

test_conditional directly compares approx_fprime result with score
truncated tests fail with shape mismatch in unit test for get_prediction which = "prob-base"

FAILED ..\eclipse-workspace\statsmodels_gh\statsmodels\statsmodels\discrete\tests\test_conditional.py::test_logit_1d
FAILED ..\eclipse-workspace\statsmodels_gh\statsmodels\statsmodels\discrete\tests\test_conditional.py::test_poisson_1d
FAILED ..\eclipse-workspace\statsmodels_gh\statsmodels\statsmodels\discrete\tests\test_truncated_model.py::TestTruncatedLFPoissonSt::test_predict
FAILED ..\eclipse-workspace\statsmodels_gh\statsmodels\statsmodels\discrete\tests\test_truncated_model.py::TestTruncatedLFPoisson1St::test_predict
pred = res1.get_prediction(which="prob-base", average=True)
        assert_allclose(pred.predicted[:k], rdf[:-1, 0], rtol=5e-5)
>       assert_allclose(pred.se[:k], rdf[:-1, 1],
                        rtol=8e-4, atol=1e-10)
E       AssertionError:
E       Not equal to tolerance rtol=0.0008, atol=1e-10
E
E       (shapes (4, 1), (4,) mismatch)
E        x: array([[5.905856e-05],
E              [3.022208e-04],
E              [7.646469e-04],
E              [1.265859e-03]])
E        y: array([5.905857e-05, 3.022208e-04, 7.646470e-04, 1.265859e-03])

The truncated failure were a consequence that poisson had returned which= prob in the wrong shape (?)
dropping a [:, None] makes the failures go away.

@josef-pkt
Copy link
Member Author

This PR now includes the fix for approx_fprime return shape.

The PR should be fine now but I don't have a good overview of what other parts might be affected.

@josef-pkt
Copy link
Member Author

Finally, main and pr are all green except for pre-testing

@josef-pkt josef-pkt merged commit 4afdb22 into statsmodels:main Apr 13, 2023
6 checks passed
@bashtage bashtage added this to the 0.14 milestone Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: BetaModel, df_xxx warning, and init kwargs
2 participants