-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I want to know how Testset Generator to create dataset works. Can anyone give like flowchart or anything. I wanna know the high level understanding #920
Comments
@sfc-gh-akashyap Check out this website - https://docs.ragas.io/en/stable/concepts/testset_generation.html |
Yeah I know about it but I wanna know more about the LLM call Prompt what is happening. For example how the embeddings are used. I think like first it randomly selects the context from the given chunks and then from that chunks it will create question. Now on another call it will create the answer using question and context using one more llm call. So I just want better understanding of this. Can you please help. |
@sfc-gh-akashyap yesterday I made this with help of Claude Opus & Mermaid (not 100% sure how much precise in details, but definitely presents the concept reasonably well) %%{init: {'theme':'neutral'}}%%
flowchart LR
subgraph Init [" "]
A[Prepare RAG Data] --> B[Generate Questions]
end
C{Question Types}
C -->|Simple| D[Generate Seed Questions]
C -->|Complex| E[Generate Reasoning Questions]
C -->|Conditional| F[Generate Scenario-based Questions]
C -->|Multi-Context| G[Generate Topic-based Questions]
D & E & F & G --> .
. --> |Generate Answers| Answers_Block
subgraph Answers_Block
direction TB
H[Filter and Refine Questions] --> I[Generate Answers]
I --> GT[Generate Ground Truth]
GT --> J[Identify Relevant Data Chunks]
J --> K[Combine Information for Comprehensive Answers]
K --> L[Create Data Rows]
L --> M[Combine Questions, Answers, Ground Truth, and Context]
M --> N[Structure Data Rows]
N --> JSON[Fix JSON Format]
JSON --> O[Assess Quality]
O -->|Low Quality| H
O -->|High Quality| Test_Set
end
style A fill:#1F4E79,color:#FFFFFF
style B fill:#2E8B57,color:#FFFFFF,stroke-width:3px
style C fill:#FF7F50,color:#FFFFFF
style D fill:#2E8B57,color:#FFFFFF,stroke-width:3px
style E fill:#2E8B57,color:#FFFFFF,stroke-width:3px
style F fill:#2E8B57,color:#FFFFFF,stroke-width:3px
style G fill:#2E8B57,color:#FFFFFF,stroke-width:3px
style Answers_Block fill:#F0F8FF,stroke:#6495ED,color:#333333
style H fill:#6495ED,color:#FFFFFF
style I fill:#2E8B57,color:#FFFFFF,stroke-width:3px
style GT fill:#FFD700,color:#000000,stroke-width:3px
style K fill:#2E8B57,color:#FFFFFF,stroke-width:3px
style L fill:#6495ED,color:#FFFFFF
style O fill:#FF6347,color:#FFFFFF
style JSON fill:#2E8B57,color:#FFFFFF,stroke-width:3px
|
According to this LLM is first generating the question then answer then identifying the chunks from the given data as ground truth. The problem with this approach I can see that LLM is creating it's own answer which should not be done it should first consider the context before answering that way it can better answer. I think there is something else. Please let me know what you think @ciekawy |
@sfc-gh-akashyap
|
Also one more doubt @omkar-334 that while embedding node does it make sure not to lose any tokens. As embeddings have different context length. |
I checked the documentation and related resources and couldn't find an answer to my question.
I wanna know the high level understanding.
I got to know what is evolution but I wanna know information like how many llm call are happening. How actually the questions are generated
The text was updated successfully, but these errors were encountered: