Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot reproduce leaderboard result #16

Open
hsl89 opened this issue Jan 29, 2023 · 1 comment
Open

cannot reproduce leaderboard result #16

hsl89 opened this issue Jan 29, 2023 · 1 comment

Comments

@hsl89
Copy link

hsl89 commented Jan 29, 2023

Hello Niklas,
I have a question regarding reproducing SGPT's result. On the mteb leaderboard, the 125M-weightedmean-msmarco-specb-bitfit model achieve 12.21 NDCG@10 on SCIDOCS. However, I wasn't able reproduce the result following the the instruction here. In my benchmarking, I got a very low number (0.00085). I think the instructions are a bit off.

My second question is that I couldn't really understand the idea behind these block . Looking at how you tokenize queries and corpus , it is much more natural to me to simply wrap queries text by [ ] and corpus text by { } before tokenizing them. I got an NDCG of 11.09 for preprocessing SCIDOCS this way, which is much closer to the reported number on the leaderboard.

@Muennighoff
Copy link
Owner

Hello Niklas, I have a question regarding reproducing SGPT's result. On the mteb leaderboard, the 125M-weightedmean-msmarco-specb-bitfit model achieve 12.21 NDCG@10 on SCIDOCS. However, I wasn't able reproduce the result following the the instruction here. In my benchmarking, I got a very low number (0.00085). I think the instructions are a bit off.

I made a small mistake uploading the script when I was trying to combine this model & this model.
I updated it & here's a colab that reproduces the 12.21 NDCG@10 exactly.

My second question is that I couldn't really understand the idea behind these block . Looking at how you tokenize queries and corpus , it is much more natural to me to simply wrap queries text by [ ] and corpus text by { } before tokenizing them. I got an NDCG of 11.09 for preprocessing SCIDOCS this way, which is much closer to the reported number on the leaderboard.

Yes you can do that, but it will produce slightly worse scores like this model. This is because, the brackets [ ] and { } may get intermingled with other tokens upon tokenization. For example, [This is a sentence] might be tokenized as "[This", " is", " a", " sent", "ence", "]". But we would like the special brackets to always be separate tokens and not interfere with the text, i.e. "[", "This".... Thus, the script uses special tokens (SOS) that are added to the vocabulary and will hence be tokenized separately. Prior to feeding the tokens to the model, these are then replaced with the actual bracket tokens here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants