Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset with incomplete combinations #76

Open
brunobian opened this issue Sep 19, 2023 · 0 comments
Open

Dataset with incomplete combinations #76

brunobian opened this issue Sep 19, 2023 · 0 comments

Comments

@brunobian
Copy link

Hi! I am using this code to work on a dataset of ~700 words. For each word I am varying several variables (size, font, position, etc) . This results in a too big dataset (+6M instances) to use all the possible combinations during training, so I decided to use a sample of the full dataset. That is pl for the training, but this creates an issue during the evaluation run.

In particular, in evaluate.compute_metrics() I found the first technical issue. To run this method the code tries to reshape samples_zCx and params_zCx tensors using the sizes of the dataset generation factors (lat_sizes) and the latent layer size (latent_dim). This is not a problem when using a dataset with all the possible combinations, but given that I now have a sample of all the possibilities, this is not the case. So, I cannot make the reshape.

I solved this by creating a tensor of np.nan and filling it with the available data in the corresponding cells (using metadata from the dataset that indicates how each instance was created). Technically, this works, but I now have doubts about how this solutions impacts on the following calculations. That is, I now have a tensor with NANs that will be used to compute the conditional entropy H(z|v), is this ok? Would it better to use zeros?

Additionally, computing the conditional entropy with the _estimate_H_zCv() method is pretty computationally expensive given that I have a big tensor full of NANs. Would it be ok to skip the cells with NANs to speedup the process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant