Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to know how Testset Generator to create dataset works. Can anyone give like flowchart or anything. I wanna know the high level understanding #920

Open
sfc-gh-akashyap opened this issue Apr 30, 2024 · 6 comments
Labels
question Further information is requested

Comments

@sfc-gh-akashyap
Copy link

sfc-gh-akashyap commented Apr 30, 2024

I checked the documentation and related resources and couldn't find an answer to my question.

I wanna know the high level understanding.
I got to know what is evolution but I wanna know information like how many llm call are happening. How actually the questions are generated

@sfc-gh-akashyap sfc-gh-akashyap added the question Further information is requested label Apr 30, 2024
@omkar-334
Copy link
Contributor

@sfc-gh-akashyap Check out this website - https://docs.ragas.io/en/stable/concepts/testset_generation.html
It has a pipeline flowchart too.

@sfc-gh-akashyap
Copy link
Author

Yeah I know about it but I wanna know more about the LLM call Prompt what is happening. For example how the embeddings are used. I think like first it randomly selects the context from the given chunks and then from that chunks it will create question. Now on another call it will create the answer using question and context using one more llm call. So I just want better understanding of this. Can you please help.

@ciekawy
Copy link

ciekawy commented May 9, 2024

@sfc-gh-akashyap yesterday I made this with help of Claude Opus & Mermaid (not 100% sure how much precise in details, but definitely presents the concept reasonably well)

%%{init: {'theme':'neutral'}}%%
flowchart LR
    subgraph Init [" "]
        A[Prepare RAG Data] --> B[Generate Questions]
    end

    C{Question Types}

    C -->|Simple| D[Generate Seed Questions]
    C -->|Complex| E[Generate Reasoning Questions]
    C -->|Conditional| F[Generate Scenario-based Questions]
    C -->|Multi-Context| G[Generate Topic-based Questions]

    D & E & F & G --> .

    . --> |Generate Answers| Answers_Block

    subgraph Answers_Block
        direction TB
        H[Filter and Refine Questions] --> I[Generate Answers]
        I --> GT[Generate Ground Truth]
        GT --> J[Identify Relevant Data Chunks]
        J --> K[Combine Information for Comprehensive Answers]
        K --> L[Create Data Rows]
        L --> M[Combine Questions, Answers, Ground Truth, and Context]
        M --> N[Structure Data Rows]
        N --> JSON[Fix JSON Format]
        JSON --> O[Assess Quality]
        O -->|Low Quality| H
        O -->|High Quality| Test_Set
    end

    style A fill:#1F4E79,color:#FFFFFF
    style B fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style C fill:#FF7F50,color:#FFFFFF
    style D fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style E fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style F fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style G fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style Answers_Block fill:#F0F8FF,stroke:#6495ED,color:#333333
    style H fill:#6495ED,color:#FFFFFF
    style I fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style GT fill:#FFD700,color:#000000,stroke-width:3px
    style K fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style L fill:#6495ED,color:#FFFFFF
    style O fill:#FF6347,color:#FFFFFF
    style JSON fill:#2E8B57,color:#FFFFFF,stroke-width:3px

@sfc-gh-akashyap
Copy link
Author

According to this LLM is first generating the question then answer then identifying the chunks from the given data as ground truth. The problem with this approach I can see that LLM is creating it's own answer which should not be done it should first consider the context before answering that way it can better answer. I think there is something else. Please let me know what you think @ciekawy

@omkar-334
Copy link
Contributor

@sfc-gh-akashyap
The testset generator basically has this pipeline.

  1. select a random node/embedding
  2. generate a seed question using that
  3. Get top_k embeddings similar to the random node selected
  4. Based on question_type, prompt formatting
  5. answer the question with obtained top_k context

@sfc-gh-akashyap
Copy link
Author

Also one more doubt @omkar-334 that while embedding node does it make sure not to lose any tokens. As embeddings have different context length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants