documentation on cosine similarity range is wrong #923

JunhaoWang · 2024-04-30T19:23:48Z

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1.

This is wrong since cosine similarity can take on negative values

a = [1.0]
b = [-1.0]
cosine_sim_a_b = a dot_product b / (a_norm x b_norm) = -1

The text was updated successfully, but these errors were encountered:

omkar-334 · 2024-05-15T12:54:13Z

You're right. Cosine smiliarity can take on negative values, althought it is heavily biased towards positive values.
A few interesting experiments on this -

Keep in mind that vector embeddings are a result of computing the probability of the word in a given context. This means that beautiful and ugly, even though they are the opposites, they have a medium cosine score, since they're likely to appear in the same context. Completely unrelated phrases like quantum mechanics cryptography algorithms and blue kingfisher eating salmon have a negative cosine score.
However, in Dataset Generation, since all of the nodes have the same/related context, it is extremely unlikely to have a negative cosine score.

JunhaoWang added the bug Something isn't working label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation on cosine similarity range is wrong #923

documentation on cosine similarity range is wrong #923

JunhaoWang commented Apr 30, 2024

omkar-334 commented May 15, 2024 •

edited

documentation on cosine similarity range is wrong #923

documentation on cosine similarity range is wrong #923

Comments

JunhaoWang commented Apr 30, 2024

omkar-334 commented May 15, 2024 • edited

omkar-334 commented May 15, 2024 •

edited