You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
model = artm.ARTM(
class_ids=modalities_list,
num_topics=num_topics[1],
scores=scores,
topic_names=[
"topic" + str(t) for t in range(num_topics[1])
],
cache_theta=True,
)
model.num_document_passes = 1
model.initialize(dictionary)
model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=1)
num_topics_child = num_topics[2] # 30
topic_names_child = ['topic_{}'.format(i) for i in range(num_topics_child)]
child_model = artm.ARTM(
seed=1,
topic_names=topic_names_child,
cache_theta=True,
class_ids=modalities_list,
parent_model=model, # when specified get_theta has wrong number of columns
parent_model_weight=2,
dictionary=dictionary,
num_document_passes=1,
#theta_name='child_theta'
)
child_model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=90)
for 41 documents, it supposed to be 30x41, but instead 30x (41 + 10 = 51) returned
initial column names in theta contain only 0 (zeroes). So it’s like 0 0 0 0 0 0 0 0 0 0 0 0, than goes real columns 1, 2, 3, … Those zeroes are appended to the end with increasing num_collection_passes. So with num_collection_passes=90, theta shape is 30 x 941.
print(child_model.get_theta())
BIGARTM documentation says that theta supposed to be number of topics X number of documents.**
The text was updated successfully, but these errors were encountered:
NikolayPavlychev
changed the title
Problem with shape Child model theta matrix
Problem with shape of Child model theta matrix
Nov 18, 2020
This is expected behaviour for hierarchical models. In that case, ARTM adds 'pseudodocuments' to the collection and tries to factorize this new collection. Each pseudodocument is related to some topic of the parent model (since the distributions p(subtopic|doc) and p(subtopic|parent_topic) are similar computationally). If you run child_model.get_parent_psi(), the Psi matrix you obtain should consist of the "extra" columns of Theta, precisely.
Also, for convergence/memory efficiency reasons the following patter is suggested:
for 41 documents, it supposed to be 30x41, but instead 30x (41 + 10 = 51) returned
initial column names in theta contain only 0 (zeroes). So it’s like 0 0 0 0 0 0 0 0 0 0 0 0, than goes real columns 1, 2, 3, … Those zeroes are appended to the end with increasing num_collection_passes. So with num_collection_passes=90, theta shape is 30 x 941.
print(child_model.get_theta())
BIGARTM documentation says that theta supposed to be number of topics X number of documents.**
The text was updated successfully, but these errors were encountered: