Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage for semantic search #47

Open
rut00 opened this issue Feb 21, 2024 · 6 comments
Open

Usage for semantic search #47

rut00 opened this issue Feb 21, 2024 · 6 comments

Comments

@rut00
Copy link

rut00 commented Feb 21, 2024

Hello,
I want to create a semantic search functionality. The model is accurate in calculating the similarity between the word synonyms. Here are a few excerpts of the demo:
Supposedly, my dataset has the following lines:

User: Who is the author of "Romeo and Juliet"?
Model: "Romeo and Juliet" was written by William Shakespeare.

User: Describe the water cycle.
Model: The water cycle is the continuous movement of water on, above, and below the surface of the Earth, involving processes like evaporation, condensation, precipitation, and runoff.

User: Describe the process of DNA replication.
Model: DNA replication is the process of copying DNA to ensure genetic continuity during cell division.

And my prompts:

Prompt 1: rameo and juliet
Output: "Romeo and Juliet" was written by William Shakespeare.

Prompt 2: Tell me about water cycle
Output: The water cycle is the continuous movement of water on, above, and below the surface of the Earth, involving processes like evaporation, condensation, precipitation, and runoff.

Prompt 3: Deoxyribonucleic acid
Output: DNA replication is the process of copying DNA to ensure genetic continuity during cell division.

Prompt 4: what is python language
Output: Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll pigments.

The confidence value for each of the prompts ranges between 0.25 and 0.4. The issue I am facing is, that the model gives the same confidence value for wrong and right prompt outputs as seen in prompt 4 output. I want to show "No results found" if the given word is not in the dataset.

How do I solve this issue and make it more efficient? Thank you in advance.

@Muennighoff
Copy link
Owner

You're using the Cross-Encoder, correct?

@rut00
Copy link
Author

rut00 commented Feb 21, 2024

No, I am using Asymmetric Semantic Search Bi-encoder.

@Muennighoff
Copy link
Owner

I see, so you're saying that the cosine similarity for what is python language and Photosynthesis is the process by which green plants and s... is as high as the other ones?

@rut00
Copy link
Author

rut00 commented Feb 21, 2024

Yes. The confidence levels are so similar that I cannot put a threshold level for differentiating them.

@Muennighoff
Copy link
Owner

Hm what model are you using? I'd recommend switching to a bigger / better one, specifically I'd recommend this one: https://huggingface.co/GritLM/GritLM-7B

@rut00
Copy link
Author

rut00 commented Feb 21, 2024

I am using this model: SGPT-125M-weightedmean-msmarco-specb-bitfit and I will try the recommended model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants