You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I've been using the ARTM model via the python API to do some topic modeling, and ran into the following bug: after training offline the model for a couple iterations, I often saw documents disappear from the theta, when retrieving it via the get_sparse_theta method. The documents in question were the same at each run (for the same seed) and had very low word counts.
Furthermore, I saw that the number would sometimes increase after decreasing, implying the data was still there, but no being returned. I was able to get rid of this problem by retrieving the dense matrix directly, by storing it in a phi matrix by providing the theta_name argument to the model's constructor. As this workaround solved the issue for me, I will not be looking further into this, but I thought you might want to know. There are more details in our repo's tracking issue if you want to check it out - the jist of it is that there is almost certainly a problem when retrieving data as sparse matrix.
Anyway, cheers
The text was updated successfully, but these errors were encountered:
@r0mainK thanks for reporting this and for linking to more details in your repository. I'll put out our list to implement additional tests for sparse retrieval of theta and phi, and to check consistency with retrieval of the dense matrix.
I came across a related bug recently. I had not check the sanity of the theta matrices when retrieving them, it seems this fix actually just created null documents rows. However, when retrieving the theta matrix with transform_sparse on the initial vectorizer, the shape and contents of the matrix where often correct - not always but still.
I also noticed that the bug seemed to appear when inducing sparsity via the SmoothSparseThetaRegularizer, either used alone or in combinations with other regularizers.
EDIT: forgot to mention this earlier, but we are using the latest tagged version (0.10.0), built following your guide in a docker instance based off ubuntu:18.04
Hey !
So I've been using the
ARTM
model via the python API to do some topic modeling, and ran into the following bug: after training offline the model for a couple iterations, I often saw documents disappear from thetheta
, when retrieving it via theget_sparse_theta
method. The documents in question were the same at each run (for the same seed) and had very low word counts.Furthermore, I saw that the number would sometimes increase after decreasing, implying the data was still there, but no being returned. I was able to get rid of this problem by retrieving the dense matrix directly, by storing it in a
phi
matrix by providing thetheta_name
argument to the model's constructor. As this workaround solved the issue for me, I will not be looking further into this, but I thought you might want to know. There are more details in our repo's tracking issue if you want to check it out - the jist of it is that there is almost certainly a problem when retrieving data as sparse matrix.Anyway, cheers
The text was updated successfully, but these errors were encountered: