Documents disappearing when using get_sparse_theta #976

r0mainK · 2019-08-02T12:40:04Z

Hey !

So I've been using the ARTM model via the python API to do some topic modeling, and ran into the following bug: after training offline the model for a couple iterations, I often saw documents disappear from the theta, when retrieving it via the get_sparse_theta method. The documents in question were the same at each run (for the same seed) and had very low word counts.

Furthermore, I saw that the number would sometimes increase after decreasing, implying the data was still there, but no being returned. I was able to get rid of this problem by retrieving the dense matrix directly, by storing it in a phi matrix by providing the theta_name argument to the model's constructor. As this workaround solved the issue for me, I will not be looking further into this, but I thought you might want to know. There are more details in our repo's tracking issue if you want to check it out - the jist of it is that there is almost certainly a problem when retrieving data as sparse matrix.

Anyway, cheers

The text was updated successfully, but these errors were encountered:

ofrei · 2019-08-07T11:33:18Z

@r0mainK thanks for reporting this and for linking to more details in your repository. I'll put out our list to implement additional tests for sparse retrieval of theta and phi, and to check consistency with retrieval of the dense matrix.

r0mainK · 2019-08-15T08:13:08Z

Update

I came across a related bug recently. I had not check the sanity of the theta matrices when retrieving them, it seems this fix actually just created null documents rows. However, when retrieving the theta matrix with transform_sparse on the initial vectorizer, the shape and contents of the matrix where often correct - not always but still.

I also noticed that the bug seemed to appear when inducing sparsity via the SmoothSparseThetaRegularizer, either used alone or in combinations with other regularizers.

For more information you can check this issue

Cheers.

EDIT: forgot to mention this earlier, but we are using the latest tagged version (0.10.0), built following your guide in a docker instance based off ubuntu:18.04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documents disappearing when using get_sparse_theta #976

Documents disappearing when using get_sparse_theta #976

r0mainK commented Aug 2, 2019

ofrei commented Aug 7, 2019

r0mainK commented Aug 15, 2019 •

edited

Documents disappearing when using get_sparse_theta #976

Documents disappearing when using get_sparse_theta #976

Comments

r0mainK commented Aug 2, 2019

ofrei commented Aug 7, 2019

r0mainK commented Aug 15, 2019 • edited

r0mainK commented Aug 15, 2019 •

edited