Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible missuse of artm.Dictionary: save/load cause interoperability problems #986

Open
Evgeny-Egorov-Projects opened this issue Sep 17, 2019 · 1 comment

Comments

@Evgeny-Egorov-Projects
Copy link

Evgeny-Egorov-Projects commented Sep 17, 2019

Working with the library came across the following issue:

if Dictionary instance is recreated by any means (e.g. load from initial batches or from a saved dictionary), while the original instance is lost or overwritten score trackers stop returning score values.
Here is the minimal code to reproduce the error:

import artm

def init_artm_model(dict_path, modalities_to_use={'@word'},):
    new_dict = artm.Dictionary(name='some_name')
    new_dict.load(dict_path)
    
    model = artm.ARTM(
        num_topics=10,
        class_ids=modalities_to_use,
    )
    model.scores.add(artm.scores.PerplexityScore(
        name='perplexity_score', dictionary=new_dict,
        class_ids=modalities_to_use
    ))
    model.initialize(new_dict)
    return model

model_artm = init_artm_model(dict_path)
model_artm.fit_offline(bv)
print(model_artm.get_score('perplexity_score')) # returns NaN

I conjecture that it's related to the following:

some_data_path = 'your_path_to_batches'
bv = artm.BatchVectorizer(data_format='batches', data_path=some_data_path )
dict_path = os.path.join(some_data_path, 'dict.dict')
test_d = artm.Dictionary(name='some_name')
test_d.gather(some_data_path,)
test_d.save(dict_path)
new_dict = artm.Dictionary(name='some_name')
new_dict.load(dict_path)
print('Are dicts identical?', new_dict == test_d) # returns False
# probably because new_dict._master.master_id != test_d._master.master_id

(note that dictionary.name is the same in all cases)

@Evgeny-Egorov-Projects
Copy link
Author

Evgeny-Egorov-Projects commented Sep 18, 2019

Further down the same line, it was discovered that Perplexity score initialization with or without a dictionary leads to inconsistent behaviour. Examples are:

model_one= artm.ARTM(
        num_processors=1,
        num_topics=5, cache_theta=True,
        num_document_passes=1, dictionary=dictionary,
        scores=[artm.PerplexityScore(name='PerplexityScore', )],
    )
    model_two = artm.ARTM(
        num_processors=1,
        num_topics=5, cache_theta=True,
        num_document_passes=1, dictionary=dictionary,
        scores=[artm.PerplexityScore(name='PerplexityScore', dictionary=dictionary)], 
    )

These two models being trained on the same data yeild different training results with last perplexity value for 0.10.0 version of artm being 2.887 for the first model and 3.027 for the second one. However for 0.9.0 these values are different: 2.887 for the first model and 0.0 for the second one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant