Gensim Word2Vec produces different most_similar results through final epoch than end of training #3429

Joshkking · 2023-01-13T13:37:16Z

Problem description

Word2Vec callbacks produce greatly different most_similar() results on last end of epoch compared to end of training. Expectation would be that the final end epoch callback similarity results are either identical or approximate of post training most similar results.

Steps/code/corpus to reproduce

I'm using gensim's Word2Vec for a recommendation-like task with part of my evaluation being the use of callbacks and the most_similar() method. However, I am noticing a huge disparity between the final few epoch callbacks and that of immediately post-training. In fact, the last epoch callback may often appear worthless, while the post training result is as best as could be desired.

My during-training tracking of most similar entries utilizes gensim's CallbackAny2Vec class. It follows the doc example fairly directly and roughly looks like:

class EpochTracker(CallbackAny2Vec):

  def __init__(self):
    self.epoch = 0

  def on_epoch_begin(self, model):
    print("Epoch #{} start".format(self.epoch))

  def on_epoch_end(self, model):
    
    print('Some diagnostics')
    # Multiple terms used in the below
    e = model.wv
    print(e.most_similar(positive=['some term'])[0:3]) # grab the top 3 examples for some term

    print("Epoch #{} end".format(self.epoch))
    self.epoch += 1

As the epochs progress, the most_similar() results given by the callbacks seem to not indicate an advancement of learning and seem erratic. In fact, often the callback from the first epoch shows the best result.

Counterintuitively, I also have an additional process (not shown) built into the callback that does indicate gradual learning. Following the similarity print, I take the current model's vectors and evaluate them against a down-stream task. In brief, this process is a sklearn GridSearchCV logistic regression check against some known labels.

I find that often the last on_epoch_end callback indicates little learning in my particular use case. However, if directly after training the model I try the similarity call again:

e = e_model.wv # e_model was the variable assignment of the model overall
print(e.most_similar(positive=['some term'])[0:3])

I tend to get beautiful results that are in agreement with the downstream evaluation task also used in the callbacks, or are at least vastly different than that of the final epoch end.

I suspect most_similar() has an unusual behavior with during-training epoch-end callbacks, but I would be happy to understand instead my approach as flawed.

Versions

macOS-10.16-x86_64-i386-64bit
Python 3.9.7 (default, Sep 16, 2021, 08:50:36)
[Clang 10.0.0]
Bits 64
NumPy 1.21.2
SciPy 1.7.3
gensim 4.1.2
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

gojomo · 2023-01-17T22:16:20Z

I believe this is the same as, or related to #2260 - but the removal of the (at-risk-of-staleness) vectors_norm cache should have cleared that up.

There is still a much-smaller cache of each vector's own mangitude, inw2v_model.wv.norms, that might be contributing to the issue. @Joshkking, in the setup where the problem was otherwise showing, what if you add a line e.fill_norms(force=True) just before your most_similar() operation? Does that make the last-epoch-end results match the after-return-from-traininig results?

If there's truly still an issue in Gensim 4.0+, it'd be good to have a small fully self-contained example that vividly demonstrates it. That'd rule out something idiosyncratic in @Joshkking's setup, and likely point to some fix, or new workaround, or new warning we could show.

Joshkking · 2023-01-17T23:09:34Z

@gojomo The instigating code is work related and has moved on with an updated environment with that class stripped out. I'll see if I can get the chance to replicate this on a smaller, publicly accessible corpora though it may be a while.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gensim Word2Vec produces different most_similar results through final epoch than end of training #3429

Gensim Word2Vec produces different most_similar results through final epoch than end of training #3429

Joshkking commented Jan 13, 2023 •

edited

gojomo commented Jan 17, 2023

Joshkking commented Jan 17, 2023 •

edited

Gensim Word2Vec produces different most_similar results through final epoch than end of training #3429

Gensim Word2Vec produces different most_similar results through final epoch than end of training #3429

Comments

Joshkking commented Jan 13, 2023 • edited

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Jan 17, 2023

Joshkking commented Jan 17, 2023 • edited

Joshkking commented Jan 13, 2023 •

edited

Joshkking commented Jan 17, 2023 •

edited