Possible missuse of artm.Dictionary: save/load cause interoperability problems #986

Evgeny-Egorov-Projects · 2019-09-17T14:28:40Z

Working with the library came across the following issue:

if Dictionary instance is recreated by any means (e.g. load from initial batches or from a saved dictionary), while the original instance is lost or overwritten score trackers stop returning score values.
Here is the minimal code to reproduce the error:

import artm

def init_artm_model(dict_path, modalities_to_use={'@word'},):
    new_dict = artm.Dictionary(name='some_name')
    new_dict.load(dict_path)
    
    model = artm.ARTM(
        num_topics=10,
        class_ids=modalities_to_use,
    )
    model.scores.add(artm.scores.PerplexityScore(
        name='perplexity_score', dictionary=new_dict,
        class_ids=modalities_to_use
    ))
    model.initialize(new_dict)
    return model

model_artm = init_artm_model(dict_path)
model_artm.fit_offline(bv)
print(model_artm.get_score('perplexity_score')) # returns NaN

I conjecture that it's related to the following:

some_data_path = 'your_path_to_batches'
bv = artm.BatchVectorizer(data_format='batches', data_path=some_data_path )
dict_path = os.path.join(some_data_path, 'dict.dict')
test_d = artm.Dictionary(name='some_name')
test_d.gather(some_data_path,)
test_d.save(dict_path)
new_dict = artm.Dictionary(name='some_name')
new_dict.load(dict_path)
print('Are dicts identical?', new_dict == test_d) # returns False
# probably because new_dict._master.master_id != test_d._master.master_id

(note that dictionary.name is the same in all cases)

The text was updated successfully, but these errors were encountered:

Evgeny-Egorov-Projects · 2019-09-18T16:05:23Z

Further down the same line, it was discovered that Perplexity score initialization with or without a dictionary leads to inconsistent behaviour. Examples are:

model_one= artm.ARTM(
        num_processors=1,
        num_topics=5, cache_theta=True,
        num_document_passes=1, dictionary=dictionary,
        scores=[artm.PerplexityScore(name='PerplexityScore', )],
    )
    model_two = artm.ARTM(
        num_processors=1,
        num_topics=5, cache_theta=True,
        num_document_passes=1, dictionary=dictionary,
        scores=[artm.PerplexityScore(name='PerplexityScore', dictionary=dictionary)], 
    )

These two models being trained on the same data yeild different training results with last perplexity value for 0.10.0 version of artm being 2.887 for the first model and 3.027 for the second one. However for 0.9.0 these values are different: 2.887 for the first model and 0.0 for the second one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible missuse of artm.Dictionary: save/load cause interoperability problems #986

Possible missuse of artm.Dictionary: save/load cause interoperability problems #986

Evgeny-Egorov-Projects commented Sep 17, 2019 •

edited

Evgeny-Egorov-Projects commented Sep 18, 2019 •

edited

Possible missuse of artm.Dictionary: save/load cause interoperability problems #986

Possible missuse of artm.Dictionary: save/load cause interoperability problems #986

Comments

Evgeny-Egorov-Projects commented Sep 17, 2019 • edited

Evgeny-Egorov-Projects commented Sep 18, 2019 • edited

Evgeny-Egorov-Projects commented Sep 17, 2019 •

edited

Evgeny-Egorov-Projects commented Sep 18, 2019 •

edited