Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating embeddings for new nodes after training #62

Open
judas123 opened this issue Jul 8, 2022 · 2 comments
Open

Calculating embeddings for new nodes after training #62

judas123 opened this issue Jul 8, 2022 · 2 comments

Comments

@judas123
Copy link

judas123 commented Jul 8, 2022

I am trying to run Cleora on a simple dataset. My TSV file is simple and follows the format of "leads attributes"

l1 <\t> a1
l2 <\t> a1
l1<\t> a2
l3 <\t> a2

Leads are connected to some attributes.

I have Set A which is used to train embeddings for all nodes ( leads and attributes ) in the set.

For new nodes with the same format of "leads attributes" in Set B, I calculate embeddings by using the following 2 methods. Then I use the embeddings for all "leads" nodes of Set A to train XGBoost model and predict on "leads" nodes of Set B to calculate the AUC.

Method 1

I jointly train embeddings by combining Set A and Set B. I get the embeddings for all "leads" nodes. On Set B, the XGBoost model AUC (trained on "leads" embeddings of Set A) is ~0.8

Method 2

I used another method as suggested in another closed issue #21 - where I train the embeddings only on Set A. Then for all "leads" nodes of Set B, I extract the embeddings of all the attributes a particular lead is connected to, average and do L2 normalization. Then with the XGBoost model trained on Set A "leads" embeddings, I predict on "leads" embeddings of Set B. The AUC drops to 0.65

Any reason why there is a drop in the AUC using Method 2 which was suggested to calculate embeddings for incoming nodes on the fly ? The alternative is method 1 where I have to retrain the graph by including new nodes every time.

Thanks

@barbara3430
Copy link

Dear @judas123
Some drop in embedding quality would be expected in the 'averaging' scenario, however your drop is big and therefore there's a few things to check. Is your Set B large? Maybe a part of attributes only appear there and are not represented in Set A, therefore no meaningful embeddings have been computed for them. This could be the case if your sets are created based on a temporal basis and some information drift appears with passing time, many new attributes are created and old ones are discarded. Another thing to consider: maybe your Set A is markedly different than Set B according to the underlying logic of the data, e.g. in Set A the 'leads' are children's toys and Set B contains clothing items, etc. Our "node reconstruction" scenario implies that the graph chunks generally have a shared logic, which can be carried over from base graph to the "new" graph connections.
Also, do note that by averaging the embeddings you're conducting an extra iteration of Cleora, so might be that you're going a step too far. Therefore you could try embedding your Set A with best_iteration-1 and try to do averaging on these embeddings. Maybe your chosen iteration number should be altogether different when training on Set A only, due to some pronounced differences in the graph. I would check the performance when training embeddings AND model on Set A to see whether the Set A embeddings are well trained.
Generally speaking, if your Method 1 works a lot better than Method 2, it makes sense to periodically recompute whole graph embeddings. Cleora is designed to be so efficient that this full recompute scenario can usually be done very often. In fact, this is what we do at our company - we simply retrain on the graph regularly to ensure max possible performance.

Hope this helps!
Barbara

@judas123
Copy link
Author

@barbara3430 Thanks for the detailed answer.

  1. Set A on which I build and train the embeddings has around 330K lead nodes and along with the attribute comes to around 2.6 Million edges. Set B has 80K lead nodes connected to the attributes.
  2. There are no new attributes in Set B, it is just a subset of attributes of Set A.
  3. I will try to run the suggested methodology of training till best_iteration -1 and then averaging the embeddings.
  4. If that does not work, probably as you suggested, Method 1 will be the approach I will go ahead with.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants