Train vs Inference methods #107

priamai · 2023-12-13T18:22:17Z

Hello there,
what is the correct way to separate training from inference?

Is this correct?
I run the training first, save the embeddings.
Then I load a new graph and do the most similar?


    args = parser.parse_args()

    if args.method=="train":

        EMBEDDING_FILENAME="word2vec.emb"
        EMBEDDING_MODEL_FILENAME="word2vec.model"

        # Precompute probabilities and generate walks
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        # Embed nodes
        model = node2vec.fit(window=10, min_count=1, batch_words=4)  # Any keywords acceptable by gensim.Word2Vec can be passed, `dimensions` and `workers` are automatically passed (from the Node2Vec constructor)

        # Save embeddings for later use
        model.wv.save_word2vec_format(EMBEDDING_FILENAME)

        # Save model for later use
        model.save(EMBEDDING_MODEL_FILENAME)

    if args.method == "test":
        # now load
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        model = node2vec.fit(window=10, min_count=1, batch_words=4)
        model.wv.load_word2vec_format(EMBEDDING_FILENAME)
        model.load(EMBEDDING_MODEL_FILENAME)

        # do some checks

        # Look for most similar nodes
        sim_nodes = model.wv.most_similar('alert--440375ba-c4af-4964-be1e-c6f9906416ff')  # Output node names are always strings

        for node, _ in sim_nodes:
            print(node)

The text was updated successfully, but these errors were encountered:

eliorc · 2023-12-14T06:12:26Z

No, I wouldn't go this way

Training is okay, but for testing you do not need Node2Vec. The algorithm outputs embeddings in a known format, once you're done creating them, you don't need the algorithm again.

So just use

from gensim.models import KeyedVectors

space = KeyedVectors.load_word2vec_format(EMBEDDING_FILENAME)

then too look up vectors, see the gensim docs

priamai · 2023-12-14T09:56:57Z

Thanks for the reference, following your suggestion is this a valid approach?
Does it make sense to save both the wor and model file? Should I just keep the model file only?
Why the edges fails to load (see last line) with an error?

    NODE_WORD_FILENAME = "word2vec.emb"
    NODE_MODEL_FILENAME = "word2vec.model"
    EDGES_WORD_FILENAME = "edges2vec.emb"

    if args.method=="train":

        # Precompute probabilities and generate walks
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        # Embed nodes
        model = node2vec.fit(window=10, min_count=1, batch_words=4)  # Any keywords acceptable by gensim.Word2Vec can be passed, `dimensions` and `workers` are automatically passed (from the Node2Vec constructor)

        # Save embeddings for later use
        model.wv.save_word2vec_format(NODE_WORD_FILENAME)

        # Save model for later use
        model.save(NODE_MODEL_FILENAME)

        edges_embs = HadamardEmbedder(keyed_vectors=model.wv)

        # Get all edges in a separate KeyedVectors instance - use with caution could be huge for big networks
        edges_kv = edges_embs.as_keyed_vectors()

        # Save embeddings for later use
        edges_kv.save_word2vec_format(EDGES_WORD_FILENAME)


    if args.method == "test":
        import re

        model = Word2Vec.load(NODE_MODEL_FILENAME)
        # this generates an error: could not convert string to float
        edges_kv = KeyedVectors.load_word2vec_format(EDGES_WORD_FILENAME)

priamai · 2023-12-14T09:57:28Z

Last error:


  File "/home/robomotic/DevOps/gitlab/ava-prod-ai/venv/lib/python3.11/site-packages/gensim/models/keyedvectors.py", line 1980, in <listcomp>
    word, weights = parts[0], [datatype(x) for x in parts[1:]]

eliorc · 2023-12-17T07:10:03Z

Which line failes? the edges_kv = or the model =?

priamai · 2023-12-17T08:19:51Z

Yes is the the keyed vector odd:

this generates an error: could not convert string to float

    edges_kv = KeyedVectors.load_word2vec_format(EDGES_WORD_FILENAME)

eliorc · 2023-12-20T12:04:38Z

I can see why this happens, because these are edges embedding

If you want to use edges embedding why not do it this way

node_embeddings = KeyedVectors.load_word2vec_format(NODE_WORD_FILENAME)
edges_embs = HadamardEmbedder(keyed_vectors=node_embeddings)

# Get all edges in a separate KeyedVectors instance - use with caution could be huge for big networks
edges_kv = edges_embs.as_keyed_vectors()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train vs Inference methods #107

Train vs Inference methods #107

priamai commented Dec 13, 2023

eliorc commented Dec 14, 2023

priamai commented Dec 14, 2023

priamai commented Dec 14, 2023

eliorc commented Dec 17, 2023

priamai commented Dec 17, 2023

eliorc commented Dec 20, 2023

Train vs Inference methods #107

Train vs Inference methods #107

Comments

priamai commented Dec 13, 2023

eliorc commented Dec 14, 2023

priamai commented Dec 14, 2023

priamai commented Dec 14, 2023

eliorc commented Dec 17, 2023

priamai commented Dec 17, 2023

this generates an error: could not convert string to float

eliorc commented Dec 20, 2023