Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with shape of Child model theta matrix #1038

Open
NikolayPavlychev opened this issue Nov 18, 2020 · 1 comment
Open

Problem with shape of Child model theta matrix #1038

NikolayPavlychev opened this issue Nov 18, 2020 · 1 comment

Comments

@NikolayPavlychev
Copy link

model = artm.ARTM(
            class_ids=modalities_list,
            num_topics=num_topics[1],
            scores=scores,
            topic_names=[
                "topic" + str(t) for t in range(num_topics[1])
            ],
            cache_theta=True,
        )
        model.num_document_passes = 1
        model.initialize(dictionary)


        model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=1)

        num_topics_child = num_topics[2] # 30
        topic_names_child = ['topic_{}'.format(i) for i in range(num_topics_child)]

        child_model = artm.ARTM(
            seed=1,
            topic_names=topic_names_child,
            cache_theta=True,
            class_ids=modalities_list,
            parent_model=model,  # when specified get_theta has wrong number of columns
            parent_model_weight=2,
            dictionary=dictionary,
            num_document_passes=1,
            #theta_name='child_theta'
        )

      child_model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=90)

for 41 documents, it supposed to be 30x41, but instead 30x (41 + 10 = 51) returned
initial column names in theta contain only 0 (zeroes). So it’s like 0 0 0 0 0 0 0 0 0 0 0 0, than goes real columns 1, 2, 3, … Those zeroes are appended to the end with increasing num_collection_passes. So with num_collection_passes=90, theta shape is 30 x 941.
print(child_model.get_theta())

BIGARTM documentation says that theta supposed to be number of topics X number of documents.**

@NikolayPavlychev NikolayPavlychev changed the title Problem with shape Child model theta matrix Problem with shape of Child model theta matrix Nov 18, 2020
@bt2901
Copy link
Contributor

bt2901 commented Nov 18, 2020

This is expected behaviour for hierarchical models. In that case, ARTM adds 'pseudodocuments' to the collection and tries to factorize this new collection. Each pseudodocument is related to some topic of the parent model (since the distributions p(subtopic|doc) and p(subtopic|parent_topic) are similar computationally). If you run child_model.get_parent_psi(), the Psi matrix you obtain should consist of the "extra" columns of Theta, precisely.

Also, for convergence/memory efficiency reasons the following patter is suggested:

model = artm.ARTM(
            ...
            # cache_theta=False,
            theta_columns_naming='title'
)
model.fit_offline(batch_vectorizer, ...)
theta = model.transform(batch_vectorizer)

The result is pandas DataFrame where the index is topic names, and columns are document names (without pseudodocuments).

You are absolutely correct that this could be made more clear in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants