Skip to content

Migrating from Gensim 3.x to 4

Gordon Mohr edited this page Sep 7, 2022 · 50 revisions

Migrating to Gensim 4.0

Gensim 4.0 is compatible with older releases (3.8.3 and prior) for the most part. Your existing stored models and code will continue to work in 4.0, except:

I. No more Python 2

Gensim 4.0+ is Python 3 only. See the Gensim & Compatibility policy page for supported Python 3 versions. They train much faster and consume less RAM (see 4.0 benchmarks).

II. Word2Vec, FastText, Doc2Vec, KeyedVectors

The *2Vec-related classes (Word2Vec, FastText, & Doc2Vec) have undergone significant internal refactoring for clarity, consistency, efficiency & maintainability.

1. size ctr parameter is now consistently vector_size everywhere:

model = Word2Vec(size=100, …)  # 🚫
model = FastText(size=100, …)  # 🚫
model = Doc2Vec(size=100, …)  # 🚫

model = Word2Vec(vector_size=100, …)  # 👍
model = FastText(vector_size=100, …)  # 👍
model = Doc2Vec(vector_size=100, …)  # 👍

2. iter ctr parameter is now consistently epochs everywhere:

model = Word2Vec(iter=5, …)  # 🚫
model = FastText(iter=5, …)  # 🚫
model = Doc2Vec(iter=5, …)  # 🚫

model = Word2Vec(epochs=5, …)  # 👍
model = FastText(epochs=5, …)  # 👍
model = Doc2Vec(epochs=5, …)  # 👍

Before, the iter name was used to match the original word2vec implementation. But epochs is more standard and descriptive, plus iter clashes with Python's built-in iter.

3. index2word and index2entity attribute is now index_to_key:

random_word = random.choice(model.wv.index2word)  # 🚫

random_word = random.choice(model.wv.index_to_key)  # 👍

This unifies the terminology: these models map keys to vectors (not just words or entities to vectors).

4. vocab dict became key_to_index for looking up a key's integer index, or get_vecattr() and set_vecattr() for other per-key attributes:

rock_idx = model.wv.vocab["rock"].index  # 🚫
rock_cnt = model.wv.vocab["rock"].count  # 🚫
vocab_len = len(model.wv.vocab)  # 🚫

rock_idx = model.wv.key_to_index["rock"]   # 👍
rock_cnt = model.wv.get_vecattr("rock", "count")  # 👍
vocab_len = len(model.wv)  # 👍

5. no more init_sims()

L2-normalized vectors are now computed dynamically, on request. The full numpy array of "normalized vectors" is no longer stored in memory:

all_normed_vectors = model.wv.get_normed_vectors()  # still works but now creates a new array on each call!

normed_vector = model.wv.vectors_norm[model.wv.vocab["rock"].index]  # 🚫

normed_vector = model.wv.get_vector("rock", norm=True)  # 👍

This allows Gensim 4.0.0 to be much more memory efficient than Gensim <4.0.

6. no more vocabulary and trainables attributes; properties previously there have been moved back to the model:

out_weights = model.trainables.syn1neg  # 🚫
min_count = model.vocabulary.min_count  # 🚫

out_weights = model.syn1neg  # 👍
min_count = model.min_count  # 👍

7. methods like most_similar(), wmdistance(), doesnt_match(), similarity(), & others moved to KeyedVectors

These methods moved from the full model (Word2Vec, Doc2Vec, FastText) object to its .wv subcomponent (of type KeyedVectors) many releases ago:

w2v_model.most_similar(word)  # 🚫
w2v_model.most_similar_cosmul(word)  # 🚫
w2v_model.wmdistance(wordlistA, wordlistB)  # 🚫
w2v_model.similar_by_word(word)  # 🚫
w2v_model.similar_by_vector(word)  # 🚫
w2v_model.doesnt_match(wordlist)  # 🚫
w2v_model.similarity(wordA, wordB)  # 🚫
w2v_model.n_similarity(wordlistA, wordlistB)  # 🚫
w2v_model.evaluate_word_pairs(wordpairs)  # 🚫
w2v_model.accuracy(questions)  # 🚫
w2v_model.log_accuracy(section)  # 🚫

w2v_model.wv.most_similar(word)  # 👍
w2v_model.wv.most_similar_cosmul(word)  # 👍
w2v_model.wv.wmdistance(wordlistA, wordlistB)  # 👍
w2v_model.wv.similar_by_word(word)  # 👍
w2v_model.wv.similar_by_vector(word)  # 👍
w2v_model.wv.doesnt_match(wordlist)  # 👍
w2v_model.wv.similarity(wordA, wordB)  # 👍
w2v_model.wv.n_similarity(wordlistA, wordlistB)  # 👍
w2v_model.wv.evaluate_word_pairs(wordpairs)  # 👍
w2v_model.wv.evaluate_word_analogies(questions)  # 👍
w2v_model.wv.log_accuracy(section)  # 👍

Most generally, if any call on a full model (Word2Vec, Doc2Vec, FastText) object only needs the word vectors to calculate its response, and you encounter a has no attribute error in Gensim 4.0.0+, make the call on the contained KeyedVectors object instead.

In addition, wmdistance will normalize vectors to unit length now by default:

# 🚫 BEFORE
model.init_sims(replace=True)  # 🚫 First normalize all embedding vectors.
distance = model.wmdistance(wordlistA, wordlistB)  # 🚫 Then compute WMD distance.

# 👍 Now in 4.0+
distance = model.wv.wmdistance(wordlistA, wordlistB)  # 👍 WMD distance over normalized embedding vectors.
distance = model.wv.wmdistance(wordlistA, wordlistB, norm=False)  # 👍 WMD over non-normalizated vectors.

8. Removed on_batch_begin and on_batch_end callbacks

These two training callbacks had muddled semantics, confused users and introduced race conditions. Use on_epoch_begin and on_epoch_end instead.

Gensim 4.0 now ignores these two functions entirely, even if implementations for them are present.

Additional Doc2Vec-specific changes

9. Doc2Vec.docvecs attribute is now Doc2Vec.dv

…and it's now a standard KeyedVectors object, so has all the standard attributes and methods of KeyedVectors (but no specialized properties like vectors_docs):

random_doc_id = np.random.randint(doc2vec_model.docvecs.count)  # 🚫
document_vector = doc2vec_model.docvecs["some_document_tag"]  # 🚫
all_docvecs = doc2vec_model.docvecs.vectors_docs  # 🚫

random_doc_id = np.random.randint(len(doc2vec_models.dv))  # 👍
document_vector = doc2vec_model.dv["some_document_tag"]  # 👍
all_docvecs = doc2vec_model.dv.vectors  # 👍

Because the vectors for document tags are now in a standard KeyedVectors, prior specific-to-Doc2Vec accessors like doctags_syn0, vectors_docs, or index_to_doctag are no longer supported; the analogous generic accessors should be used instead:

all_docvecs = doc2vec_model.docvecs.doctag_syn0  # 🚫
all_docvecs = doc2vec_model.docvecs.vectors_docs  # 🚫 
doctag = doc2vec_model.docvecs.index_to_doctag[n]  # 🚫 

all_docvecs = doc2vec_model.dv.vectors  # 👍 
doctag = doc2vec_model.dv.index_to_key[n]  # 👍

Additional FastText-specific changes

10. check if a word is fully "OOV" (out of vocabulary) for FastText:

"night" in model.wv.vocab  # 🚫

"night" in model.wv.key_to_index  # 👍

Of course, even OOV words have vectors in FastText (assembled from vectors of their character ngrams), so the following is not a good way to test the presence of a vector:

"no_such_word" in model.wv  # 🚫 always returns True for FastText!
model.wv["no_such_word"]  # returns a vector even for OOV words

Other advanced notes

The following notes are for advanced users, who were using or extending the Gensim internals more deeply, perhaps relying on protected / private attributes.

  • A key change is the creation of a unified KeyedVectors class for working with sets-of-vectors, that's reused for both word-vectors and doc-vectors, both when these are a subcomponent of the full algorithm models (for training) and when they are separate vector-sets (for lighter-weight re-use). Thus, this unified class shares the same (& often improved) convenience methods & implementations.

  • One notable internal implementation change means that performing the usual similarity operations no longer requires the creation of a 2nd full cache of unit-normalized vectors, via the .init_sims() method & stored in the .vectors_norm property. That used to involve a noticeable delay on 1st use, much higher memory use, and extra complications when attempting to deploy/share vectors among multiple processes.

  • A number of errors and inefficiencies in the FastText implementation have been corrected. Model size in memory and when saved to disk will be much smaller, and using FastText as if it were Word2Vec, by disabling character n-grams (with max_n=0), should be as fast & performant as vanilla Word2Vec.

  • When supplying a Python iterable corpus to instance-initialization, build_vocab(), or train(), the parameter name is now corpus_iterable, to reflect the central expectation (that it is an iterable) and for correspondence with the corpus_file alternative. The prior model-specific names for this parameter, like sentences or documents, were overly-specific, and sometimes led users to the mistaken belief that such input must be precisely natural-language sentences.

If you're unsure or getting unexpected results, let us know at the Gensim mailing list.

III. Phraser

11. Phraser class is now called FrozenPhrases

…to be more explicit in its intent, and easier to tell apart from its chunkier parent Phrases:

phrases = Phrases(corpus)
phraser = Phraser(phrases)  # 🚫

phrases = Phrases(corpus)
frozen_phrases = phrases.freeze()  # 👍

Note that phrases (collocation detection, multi-word expressions) have been pretty much rewritten from scratch for Gensim 4.0, and are more efficient and flexible now overall.

IV. Removal of deprecations and unmaintained modules

12. Removed gensim.summarization

Despite its general-sounding name, the module will not satisfy the majority of use cases in production and is likely to waste people's time. See this Github ticket for more motivation behind this.

13. Removed "TFIDF pivoted normalization".

A rarely used contributed module, of poor quality of both code and documentation.

14. Renamed similarities.index to similarities.annoy

The original module was named too broadly. Now it's clearer this module employs the Annoy kNN library, while there's also similarities.nmslib etc.

15. Removed third party wrappers

These wrappers of 3rd party libraries required too much effort. There were no volunteers to maintain and support them properly in Gensim.

If your work depends on any of the modules below, feel free to copy it out of Gensim 3.8.3 (the last release where they appear), and extend & maintain the wrapper yourself.

The removed submodules are:

- gensim.models.wrappers.dtmmodel
- gensim.models.wrappers.ldamallet
- gensim.models.wrappers.ldavowpalwabbit
- gensim.models.wrappers.varembed
- gensim.models.wrappers.wordrank
- gensim.sklearn_api.atmodel
- gensim.sklearn_api.d2vmodel
- gensim.sklearn_api.ftmodel
- gensim.sklearn_api.hdp
- gensim.sklearn_api.ldamodel
- gensim.sklearn_api.ldaseqmodel
- gensim.sklearn_api.lsimodel
- gensim.sklearn_api.phrases
- gensim.sklearn_api.rpmodel
- gensim.sklearn_api.text2bow
- gensim.sklearn_api.tfidf
- gensim.sklearn_api.w2vmodel
- gensim.viz
Clone this wiki locally