Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gensim Word2Vec produces different most_similar results through final epoch than end of training #3429

Open
Joshkking opened this issue Jan 13, 2023 · 2 comments

Comments

@Joshkking
Copy link

Joshkking commented Jan 13, 2023

Problem description

Word2Vec callbacks produce greatly different most_similar() results on last end of epoch compared to end of training. Expectation would be that the final end epoch callback similarity results are either identical or approximate of post training most similar results.

Steps/code/corpus to reproduce

I'm using gensim's Word2Vec for a recommendation-like task with part of my evaluation being the use of callbacks and the most_similar() method. However, I am noticing a huge disparity between the final few epoch callbacks and that of immediately post-training. In fact, the last epoch callback may often appear worthless, while the post training result is as best as could be desired.

My during-training tracking of most similar entries utilizes gensim's CallbackAny2Vec class. It follows the doc example fairly directly and roughly looks like:

class EpochTracker(CallbackAny2Vec):

  def __init__(self):
    self.epoch = 0

  def on_epoch_begin(self, model):
    print("Epoch #{} start".format(self.epoch))

  def on_epoch_end(self, model):
    
    print('Some diagnostics')
    # Multiple terms used in the below
    e = model.wv
    print(e.most_similar(positive=['some term'])[0:3]) # grab the top 3 examples for some term

    print("Epoch #{} end".format(self.epoch))
    self.epoch += 1

As the epochs progress, the most_similar() results given by the callbacks seem to not indicate an advancement of learning and seem erratic. In fact, often the callback from the first epoch shows the best result.

Counterintuitively, I also have an additional process (not shown) built into the callback that does indicate gradual learning. Following the similarity print, I take the current model's vectors and evaluate them against a down-stream task. In brief, this process is a sklearn GridSearchCV logistic regression check against some known labels.

I find that often the last on_epoch_end callback indicates little learning in my particular use case. However, if directly after training the model I try the similarity call again:

e = e_model.wv # e_model was the variable assignment of the model overall
print(e.most_similar(positive=['some term'])[0:3])

I tend to get beautiful results that are in agreement with the downstream evaluation task also used in the callbacks, or are at least vastly different than that of the final epoch end.

I suspect most_similar() has an unusual behavior with during-training epoch-end callbacks, but I would be happy to understand instead my approach as flawed.

Versions

macOS-10.16-x86_64-i386-64bit
Python 3.9.7 (default, Sep 16, 2021, 08:50:36)
[Clang 10.0.0]
Bits 64
NumPy 1.21.2
SciPy 1.7.3
gensim 4.1.2
FAST_VERSION 1

@gojomo
Copy link
Collaborator

gojomo commented Jan 17, 2023

I believe this is the same as, or related to #2260 - but the removal of the (at-risk-of-staleness) vectors_norm cache should have cleared that up.

There is still a much-smaller cache of each vector's own mangitude, inw2v_model.wv.norms, that might be contributing to the issue. @Joshkking, in the setup where the problem was otherwise showing, what if you add a line e.fill_norms(force=True) just before your most_similar() operation? Does that make the last-epoch-end results match the after-return-from-traininig results?

If there's truly still an issue in Gensim 4.0+, it'd be good to have a small fully self-contained example that vividly demonstrates it. That'd rule out something idiosyncratic in @Joshkking's setup, and likely point to some fix, or new workaround, or new warning we could show.

@Joshkking
Copy link
Author

Joshkking commented Jan 17, 2023

@gojomo The instigating code is work related and has moved on with an updated environment with that class stripped out. I'll see if I can get the chance to replicate this on a smaller, publicly accessible corpora though it may be a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants