Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation on cosine similarity range is wrong #923

Open
JunhaoWang opened this issue Apr 30, 2024 · 1 comment
Open

documentation on cosine similarity range is wrong #923

JunhaoWang opened this issue Apr 30, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@JunhaoWang
Copy link

In doc,

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1.

This is wrong since cosine similarity can take on negative values

a = [1.0]
b = [-1.0]
cosine_sim_a_b = a dot_product b / (a_norm x b_norm) = -1
@JunhaoWang JunhaoWang added the bug Something isn't working label Apr 30, 2024
@omkar-334
Copy link
Contributor

omkar-334 commented May 15, 2024

You're right. Cosine smiliarity can take on negative values, althought it is heavily biased towards positive values.
A few interesting experiments on this -

  1. https://datascience.stackexchange.com/questions/101862/cosine-similarity-between-sentence-embeddings-is-always-positive#:~:text=This%20range%20is%20valid%20if,line%20but%20in%20opposite%20directions.
  2. https://stackoverflow.com/questions/60852877/why-is-my-cosine-similarity-always-positive-fasttext

Keep in mind that vector embeddings are a result of computing the probability of the word in a given context. This means that beautiful and ugly, even though they are the opposites, they have a medium cosine score, since they're likely to appear in the same context. Completely unrelated phrases like quantum mechanics cryptography algorithms and blue kingfisher eating salmon have a negative cosine score.
However, in Dataset Generation, since all of the nodes have the same/related context, it is extremely unlikely to have a negative cosine score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants