Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributions are not independent for Empirical #1776

Open
sfo opened this issue Dec 11, 2023 · 1 comment
Open

Distributions are not independent for Empirical #1776

sfo opened this issue Dec 11, 2023 · 1 comment

Comments

@sfo
Copy link

sfo commented Dec 11, 2023

While the documentation states that

The first k dimensions index into a batch of independent distributions

this is not the case for the implementation.
As can be seen from this MWE, the samples for every distribution are always the same:

import numpy as np
import tensorflow_probability as tfp

a = np.random.randint(0, 100, size=1000)
dist = tfp.distributions.Empirical([a, a])
n = 3
dist.sample(n)

# output
# <tf.Tensor: shape=(3, 2), dtype=int32, numpy=
# array([[66, 66],
#        [33, 33],
#        [98, 98]], dtype=int32)>

The root cause seems to be at this location, where indices are sampled only once but re-used for every distribution:

indices = samplers.uniform([n], maxval=self._compute_num_samples(samples),
dtype=tf.int32, seed=seed)
draws = tf.gather(samples, indices, axis=self._samples_axis)

I now fear that there might be more issues in the implementation of Empirical, rendering it unusable in contexts where I require independency between distributions.

Also the shaping of the output seems strange, which is (n, k), while I would expect it to be (k, n).

Maybe I also completely misunderstood the documentation.

@csuter
Copy link
Member

csuter commented Dec 11, 2023

Wow, yeah, that's a pretty egregious bug. Thank you for flagging. There's pretty good test coverage of this distribution, including, eg, statistical tests of mean and variance, even in the presence of batch shapes. But nothing checks independence of the samples across batches! Should be a straightforward-ish fix (will need to sample indices for each batch dimension and change the gather to a gather_nd, which is always fun :)).

Are you interested in trying to submit a patch? No pressure, just don't want to duplicate work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants