Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a problem in the eval.py #5

Closed
zhangzy6 opened this issue Jan 26, 2019 · 11 comments
Closed

a problem in the eval.py #5

zhangzy6 opened this issue Jan 26, 2019 · 11 comments

Comments

@zhangzy6
Copy link

I find a problem in the eval.py. when the scores of these entities are same, the ranking of the all entities are 1. Can you explain this?

@daiquocnguyen
Copy link
Owner

It is a simple intuition that triples having a same scores should have a same rank.

@timlacroix
Copy link

timlacroix commented Jun 12, 2019

Hi, I believe this is wrong, @daiquocnguyen.
Following your current evaluation, a trivial model giving a score of 0 to all triples would have a perfect MRR of 1.

Two simple fixes:
You can either be optimistic and use the average rank of all ties (scipy.stats.rankdata with method=average) or pessimistic and use the worst rank of all ties (method=max) as is done for example here : https://github.com/facebookresearch/kbc

Currently your results for ConvKB and CapsE are not comparable to the remainder of the literature and do not make sense (since a trivial model would be optimal).
Re-running your evaluation with method='max' gives an MRR of 0.21 on FB15k-237.

@daiquocnguyen
Copy link
Owner

daiquocnguyen commented Jun 12, 2019

At the begging, in order to work with a batch size when evaluating ConvKB for the knowledge graph completion, I replicated each correct test triple several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said that triples having a same score should have same rank. Those triples I mean are the correct test triple and its replicated triples.

Last year, I found out that someone mentioned an issue that some different triples have also a same score on FB15k-237 in Openreview ICLR2019.

I don't know how many triples as they mentioned, but this probably does not happens in WN18RR, WN11, FB13 and SEARCH17.
I believe my model implementation itself and its evaluation do not have any problem.
And I think FB15k-237 is a quite special dataset, such that TransE outperforms Distmult, ComplEx and even ConvE (v1 July 2017 when we worked on).

@timlacroix
Copy link

timlacroix commented Jun 13, 2019

Hi,

"At the begging, in order to work with a batch size when evaluating ConvKB for the knowledge graph completion, I replicated each correct test triple several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said that triples having a same score should have same rank. Those triples I mean are the correct test triple and its replicated triples."

No. Each triple in the list new_x_batch is unique. There are exactly n_entities elements in new_x_batch, each with a unique head or tail. Then you "filter" all correct triples out line 195-202, before re-adding the correct triple (which has been removed before) line 205-206.
Since there are no duplicated triples, there is no reason why two different triples with the same score should have the same rank.

Currently, the model that performs better for your evaluation is a model that gives a score of 0 to all triples.

Your numbers for ComplEx are outdated, please see https://github.com/facebookresearch/kbc for up-to-date numbers.

@daiquocnguyen
Copy link
Owner

At the beginning I mean it's 1 years and a half ago, when using a batch size, for each correct test triple, I replicated it several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said in my previous comment above. After the discussion in Openreview last year, I made a new implementation without using a batch size for the evaluation as you now see here.

@chenwang1701
Copy link

chenwang1701 commented Dec 20, 2019

So is the problem already solved? Could we use the eval.py to evaluate the methods modified from ConvKB?

@daiquocnguyen
Copy link
Owner

daiquocnguyen commented May 25, 2020

I would like to clarify that there is no problem in eval.py itself w.r.t the implementation of the "test_prediction" function. You can integrate the test_prediction function in eval.py to your code.

As the issue doesn't appear on other datasets, I still don't know what exact answer is.

@timlacroix
Copy link

@chenwang1701, if you use eval.py from this repo, you'll suffer from the same problems mentioned in this issue. However, the fix to obtain results that are comparable to the state of the art is simple :
On line 222 of eval.py just use method = average, as recommended here : https://arxiv.org/pdf/1911.03903.pdf

@daiquocnguyen
Copy link
Owner

daiquocnguyen commented May 27, 2020

@timlacroix @chenwang1701 @zhangzy6 @AndRossi This paper above was accepted to ACL 2020 and advertised via Twitter, thus I knew it. I see this paper contributes some findings to the evaluation protocol. But it's not fair when the paper does not mention that the issue is "only" on FB15k237, where I still don't know the exact answer. And it does not mean that ConvKB can not be served as a good baseline at this time.

I have found the ACL 2019 paper (and the Pytorch code), where the authors used ConvKB in the decoder. The results are still good for our ConvKB. And I can get better results with suitable initialization.

So, for technical speaking, a reasonable answer comes from Pytorch (instead of Tensorflow), where ConvKB does not get the issue on FB15k237. I will plan to reimplement our ConvKB in Pytorch to answer you further.

@daiquocnguyen daiquocnguyen reopened this May 27, 2020
@daiquocnguyen
Copy link
Owner

daiquocnguyen commented May 27, 2020

@chenwang1701 I believe that the issue does not come from the implementation of eval.py itself. Theoretical speaking, using results_with_id = rankdata(results, method='ordinal') is similar to using results_with_id = rankdata(results, method='average') because different triples will have different scores given an existing KG embedding model.

In a special case when you have to use a fixed batch size, and if you replicate each test triple several times (to add to its set of corrupted triples) to fulfill a batch size, using method='ordinal' is a right choice because each test triple and its replicated identical triples must have a same rank. That was what I did three years ago.

The issue mentioned here is due to a finding that ConvKB returns a same score for different triples only on FB15K237. Theoretical speaking, it should not happen. I gave an answer in the comment above.

@daiquocnguyen
Copy link
Owner

daiquocnguyen commented Jun 12, 2020

@timlacroix @zhangzy6 @AndRossi I have just released the Pytorch implementation of our ConvKB, which is based on the OpenKE framework, to deal with the issue to show you that the ACL 2020 paper is not fair when re-evaluating our model. The authors just changed a line of code in eval.py to make their claims and did not give a reasonable solution for our ConvKB model on FB15K237.

The results obtained by the Pytorch implementation with using the existing initialization are as follows:

Dataset MR MRR Hits@10

WN18RR 2741 0.220 50.83

FB15K237 196 0.302 48.31

These obtained results are sate-of-the-art ones in 2017.

Note that ConvKB is used as a decoder to improve the task performance. You can get better results if you use the initialization produced by TransE from the OpenKE framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants