Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documents disappearing when using get_sparse_theta #976

Open
r0mainK opened this issue Aug 2, 2019 · 2 comments
Open

Documents disappearing when using get_sparse_theta #976

r0mainK opened this issue Aug 2, 2019 · 2 comments

Comments

@r0mainK
Copy link

r0mainK commented Aug 2, 2019

Hey !

So I've been using the ARTM model via the python API to do some topic modeling, and ran into the following bug: after training offline the model for a couple iterations, I often saw documents disappear from the theta, when retrieving it via the get_sparse_theta method. The documents in question were the same at each run (for the same seed) and had very low word counts.

Furthermore, I saw that the number would sometimes increase after decreasing, implying the data was still there, but no being returned. I was able to get rid of this problem by retrieving the dense matrix directly, by storing it in a phi matrix by providing the theta_name argument to the model's constructor. As this workaround solved the issue for me, I will not be looking further into this, but I thought you might want to know. There are more details in our repo's tracking issue if you want to check it out - the jist of it is that there is almost certainly a problem when retrieving data as sparse matrix.

Anyway, cheers

@ofrei
Copy link
Contributor

ofrei commented Aug 7, 2019

@r0mainK thanks for reporting this and for linking to more details in your repository. I'll put out our list to implement additional tests for sparse retrieval of theta and phi, and to check consistency with retrieval of the dense matrix.

@r0mainK
Copy link
Author

r0mainK commented Aug 15, 2019

Update

I came across a related bug recently. I had not check the sanity of the theta matrices when retrieving them, it seems this fix actually just created null documents rows. However, when retrieving the theta matrix with transform_sparse on the initial vectorizer, the shape and contents of the matrix where often correct - not always but still.

I also noticed that the bug seemed to appear when inducing sparsity via the SmoothSparseThetaRegularizer, either used alone or in combinations with other regularizers.

For more information you can check this issue

Cheers.

EDIT: forgot to mention this earlier, but we are using the latest tagged version (0.10.0), built following your guide in a docker instance based off ubuntu:18.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants