Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train vs Inference methods #107

Open
priamai opened this issue Dec 13, 2023 · 6 comments
Open

Train vs Inference methods #107

priamai opened this issue Dec 13, 2023 · 6 comments

Comments

@priamai
Copy link

priamai commented Dec 13, 2023

Hello there,
what is the correct way to separate training from inference?

Is this correct?
I run the training first, save the embeddings.
Then I load a new graph and do the most similar?


    args = parser.parse_args()

    if args.method=="train":

        EMBEDDING_FILENAME="word2vec.emb"
        EMBEDDING_MODEL_FILENAME="word2vec.model"

        # Precompute probabilities and generate walks
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        # Embed nodes
        model = node2vec.fit(window=10, min_count=1, batch_words=4)  # Any keywords acceptable by gensim.Word2Vec can be passed, `dimensions` and `workers` are automatically passed (from the Node2Vec constructor)

        # Save embeddings for later use
        model.wv.save_word2vec_format(EMBEDDING_FILENAME)

        # Save model for later use
        model.save(EMBEDDING_MODEL_FILENAME)

    if args.method == "test":
        # now load
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        model = node2vec.fit(window=10, min_count=1, batch_words=4)
        model.wv.load_word2vec_format(EMBEDDING_FILENAME)
        model.load(EMBEDDING_MODEL_FILENAME)

        # do some checks

        # Look for most similar nodes
        sim_nodes = model.wv.most_similar('alert--440375ba-c4af-4964-be1e-c6f9906416ff')  # Output node names are always strings

        for node, _ in sim_nodes:
            print(node)

@eliorc
Copy link
Owner

eliorc commented Dec 14, 2023

No, I wouldn't go this way

Training is okay, but for testing you do not need Node2Vec. The algorithm outputs embeddings in a known format, once you're done creating them, you don't need the algorithm again.

So just use

from gensim.models import KeyedVectors

space = KeyedVectors.load_word2vec_format(EMBEDDING_FILENAME)

then too look up vectors, see the gensim docs

@priamai
Copy link
Author

priamai commented Dec 14, 2023

Thanks for the reference, following your suggestion is this a valid approach?
Does it make sense to save both the wor and model file? Should I just keep the model file only?
Why the edges fails to load (see last line) with an error?

    NODE_WORD_FILENAME = "word2vec.emb"
    NODE_MODEL_FILENAME = "word2vec.model"
    EDGES_WORD_FILENAME = "edges2vec.emb"

    if args.method=="train":

        # Precompute probabilities and generate walks
        node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200,workers=4)  # Use temp_folder for big graphs

        # Embed nodes
        model = node2vec.fit(window=10, min_count=1, batch_words=4)  # Any keywords acceptable by gensim.Word2Vec can be passed, `dimensions` and `workers` are automatically passed (from the Node2Vec constructor)

        # Save embeddings for later use
        model.wv.save_word2vec_format(NODE_WORD_FILENAME)

        # Save model for later use
        model.save(NODE_MODEL_FILENAME)

        edges_embs = HadamardEmbedder(keyed_vectors=model.wv)

        # Get all edges in a separate KeyedVectors instance - use with caution could be huge for big networks
        edges_kv = edges_embs.as_keyed_vectors()

        # Save embeddings for later use
        edges_kv.save_word2vec_format(EDGES_WORD_FILENAME)


    if args.method == "test":
        import re

        model = Word2Vec.load(NODE_MODEL_FILENAME)
        # this generates an error: could not convert string to float
        edges_kv = KeyedVectors.load_word2vec_format(EDGES_WORD_FILENAME)

@priamai
Copy link
Author

priamai commented Dec 14, 2023

Last error:


  File "/home/robomotic/DevOps/gitlab/ava-prod-ai/venv/lib/python3.11/site-packages/gensim/models/keyedvectors.py", line 1980, in <listcomp>
    word, weights = parts[0], [datatype(x) for x in parts[1:]]

@eliorc
Copy link
Owner

eliorc commented Dec 17, 2023

Which line failes? the edges_kv = or the model =?

@priamai
Copy link
Author

priamai commented Dec 17, 2023

Yes is the the keyed vector odd:

this generates an error: could not convert string to float

    edges_kv = KeyedVectors.load_word2vec_format(EDGES_WORD_FILENAME)

@eliorc
Copy link
Owner

eliorc commented Dec 20, 2023

I can see why this happens, because these are edges embedding

If you want to use edges embedding why not do it this way

node_embeddings = KeyedVectors.load_word2vec_format(NODE_WORD_FILENAME)
edges_embs = HadamardEmbedder(keyed_vectors=node_embeddings)

# Get all edges in a separate KeyedVectors instance - use with caution could be huge for big networks
edges_kv = edges_embs.as_keyed_vectors()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants