REF: get_distribution, return 1-d instead of column frozen distribution #8780

josef-pkt · 2023-04-07T18:45:06Z

This also warns now on invalid __init__ kwargs in BetaModel closes #8779

I changed the get_distribution in discrete to return 1-d instead of 2d.
~~Some unit tests might fail.~~ They didn't
Some parts don't fail because they are written to handle both 1-d or one 2-d column returns, they work around current inconsistency in return shape.

Not decided yet:
Do we add an option as_column=False to get_distribution?
Currently, I commented out the old column code, so I can reuse it if we add the option.
update not for now

Related missing:
GLM and OLS need results method (they only have model.get_distribution)

josef-pkt · 2023-04-07T19:27:58Z

no failures so far, which is surprising or insufficient unit test coverage

in discrete_model only NegativeBinomial predict(which="prob") calls get_distribution, Poisson, GPP, NBP use the distribution directly
in count_model zeroinflated models never call get_distribution, _predict_prob also computes probabilites from scratch.

So, only NegativeBinomial predict(which="prob") could be wrong because of the shape change in get_distribution

(Now, I don't remember why I switched to column distribution if I didn't use it much internally in statsmodels.)

two unit tests in discrete that call get_distribution:
test_discrete check_distr function uses squeeze with all return
and test_predict test_distr I had already checked before, it was also designed to handle both shapes

josef-pkt · 2023-04-08T21:42:28Z

I think I will not add work on RegressionModel/Results get_distribution to this PR.
AFAICS, weights in predict handling is only used in get_prediction and it's outsourced to module statsmodels.regression._prediction

This means adding results.get_prediction will not be so simple, unless it calls get_prediction, but that's overkill. We only need weights handling, not standard errors.

The predict method is currently only for mean prediction, first moment, which does not need (var_)weights.

josef-pkt · 2023-04-12T18:03:59Z

refactoring bug

NegativeBinomial res.predict(which="prob") fails with shape mismatch, get_diagnostic plot_probs fails as a consequence
no unit test for it. I had largely ignored NB model in favor of NBP in recent changes and checks.

probs = res.predict(which="prob", y_values=np.arange(5))
is also broken now, y_values need to be a 2-d column if get_distribution distr args are 1d
(that's why I had use column distr args)

I can fix this by reshaping y_values internally for which = "predict", but users will need to reshape if they want vectorized evaluation of distr methods, e.g. cdf at different values.

josef-pkt · 2023-04-12T19:59:21Z

something weird with deltacov if params has only one element

this works

res = smd.Poisson(y, x[:, :2]).fit()
res.get_prediction(which="prob", average=True).summary_frame()

but using one column exog raises exception in get_prediction
res = smd.Poisson(y, x[:, :1]).fit()

I'm in the middle of making some changes for which="prob", so no clean code or example.

NB, NBP and GPP don't fail (they have one extra params, so len(params)=2), but one parameter NB-geometric fails in the same way.

AFAIR, we don't have another get_prediction case that has multivariate prediction.
and I guess, I never checked the case before with only one parameter in a model.

josef-pkt · 2023-04-13T13:59:36Z

likely reason for problem in DeltaCov with 1-element params

approx_fprime returns 1d gradient if len(params) == 1


from statsmodels.tools.numdiff import approx_fprime_cs, approx_fprime

x = np.arange(5)
def fun(params):
    return params * x

params = np.asarray([1])
jac = approx_fprime_cs(params, fun)
jac2 = approx_fprime(params, fun)
jac.shape, jac2.shape
((5, 1), (5,))

for len(params) = 2:

x = np.arange(10).reshape(-1, 2)
def fun(params):
    return (params * x).sum(-1)

params = np.asarray([1, 2])
jac = approx_fprime_cs(params, fun)
jac2 = approx_fprime(params, fun)
jac.shape, jac2.shape
((5, 2), (5, 2))

update
indirectly confirmed that approx_fprime is the problem.
I changed Poisson.predict for which="prob" to use _pmf instead of pmf, stats.poisson._pmf which does not raise an exception of the poisson parameter is complex. In that case deltacov uses approx_fprime_cs and res.get_prediction(which="prob", average=True) does not file anymore when len(params) is 1.

I cannot do the same for NB geometric because it uses get_distribution for predict which="prob", and scipy frozen distributions do not have the "private" methods like _pmf.

If we want to use complex step derivatives for predict/get_prediction, then we cannot go through the "public" methods of scipy.distributions, i.e. we need to implement it without going through get_distribution.

josef-pkt · 2023-04-13T17:10:58Z

when I change the return of approx_fprime to skip the squeeze if len(params) is 1

    if n == 1:
        return grad.T
    else:
        return grad.squeeze().T

then I get unit test failures (run only in discrete.tests) where we get now a one column array instead of a 1-d array.

test_conditional directly compares approx_fprime result with score
truncated tests fail with shape mismatch in unit test for get_prediction which = "prob-base"

FAILED ..\eclipse-workspace\statsmodels_gh\statsmodels\statsmodels\discrete\tests\test_conditional.py::test_logit_1d
FAILED ..\eclipse-workspace\statsmodels_gh\statsmodels\statsmodels\discrete\tests\test_conditional.py::test_poisson_1d
FAILED ..\eclipse-workspace\statsmodels_gh\statsmodels\statsmodels\discrete\tests\test_truncated_model.py::TestTruncatedLFPoissonSt::test_predict
FAILED ..\eclipse-workspace\statsmodels_gh\statsmodels\statsmodels\discrete\tests\test_truncated_model.py::TestTruncatedLFPoisson1St::test_predict

pred = res1.get_prediction(which="prob-base", average=True)
        assert_allclose(pred.predicted[:k], rdf[:-1, 0], rtol=5e-5)
>       assert_allclose(pred.se[:k], rdf[:-1, 1],
                        rtol=8e-4, atol=1e-10)
E       AssertionError:
E       Not equal to tolerance rtol=0.0008, atol=1e-10
E
E       (shapes (4, 1), (4,) mismatch)
E        x: array([[5.905856e-05],
E              [3.022208e-04],
E              [7.646469e-04],
E              [1.265859e-03]])
E        y: array([5.905857e-05, 3.022208e-04, 7.646470e-04, 1.265859e-03])

The truncated failure were a consequence that poisson had returned which= prob in the wrong shape (?)
dropping a [:, None] makes the failures go away.

…ict prob

josef-pkt · 2023-04-13T18:45:25Z

This PR now includes the fix for approx_fprime return shape.

The PR should be fine now but I don't have a good overview of what other parts might be affected.

josef-pkt · 2023-04-13T19:35:30Z

Finally, main and pr are all green except for pre-testing

REF: get_distribution, return 1-d instead of column frozen distribution

7bd1bb1

josef-pkt added comp-discrete type-refactor labels Apr 7, 2023

josef-pkt added 2 commits April 7, 2023 16:25

ENH: BetaModel warn if invalid __init__ kwargs closes statsmodels#8779

37ddfaf

ENH: add get_distribution to GLMResults

21e5592

josef-pkt mentioned this pull request Apr 13, 2023

TST: check and add unit tests for one parameter models #8792

Open

BUG/REF: fix return shape of approx_fprime if 1 params, refactor pred…

992d72c

…ict prob

josef-pkt merged commit 4afdb22 into statsmodels:main Apr 13, 2023
6 checks passed

This was referenced Apr 14, 2023

ENH/SUMM: add rvs or distributions to families #106

Open

ENH/REF get_distribution 2dim column_vector is annoying, results methods missing #8723

Closed

bashtage added this to the 0.14 milestone Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: get_distribution, return 1-d instead of column frozen distribution #8780

REF: get_distribution, return 1-d instead of column frozen distribution #8780

josef-pkt commented Apr 7, 2023 •

edited

josef-pkt commented Apr 7, 2023 •

edited

josef-pkt commented Apr 8, 2023

josef-pkt commented Apr 12, 2023

josef-pkt commented Apr 12, 2023

josef-pkt commented Apr 13, 2023 •

edited

josef-pkt commented Apr 13, 2023 •

edited

josef-pkt commented Apr 13, 2023

josef-pkt commented Apr 13, 2023

REF: get_distribution, return 1-d instead of column frozen distribution #8780

REF: get_distribution, return 1-d instead of column frozen distribution #8780

Conversation

josef-pkt commented Apr 7, 2023 • edited

josef-pkt commented Apr 7, 2023 • edited

josef-pkt commented Apr 8, 2023

josef-pkt commented Apr 12, 2023

josef-pkt commented Apr 12, 2023

josef-pkt commented Apr 13, 2023 • edited

josef-pkt commented Apr 13, 2023 • edited

josef-pkt commented Apr 13, 2023

josef-pkt commented Apr 13, 2023

josef-pkt commented Apr 7, 2023 •

edited

josef-pkt commented Apr 7, 2023 •

edited

josef-pkt commented Apr 13, 2023 •

edited

josef-pkt commented Apr 13, 2023 •

edited