I want to know how Testset Generator to create dataset works. Can anyone give like flowchart or anything. I wanna know the high level understanding #920

sfc-gh-akashyap · 2024-04-30T09:12:36Z

I checked the documentation and related resources and couldn't find an answer to my question.

I wanna know the high level understanding.
I got to know what is evolution but I wanna know information like how many llm call are happening. How actually the questions are generated

omkar-334 · 2024-05-08T23:32:55Z

@sfc-gh-akashyap Check out this website - https://docs.ragas.io/en/stable/concepts/testset_generation.html
It has a pipeline flowchart too.

sfc-gh-akashyap · 2024-05-09T09:32:41Z

Yeah I know about it but I wanna know more about the LLM call Prompt what is happening. For example how the embeddings are used. I think like first it randomly selects the context from the given chunks and then from that chunks it will create question. Now on another call it will create the answer using question and context using one more llm call. So I just want better understanding of this. Can you please help.

ciekawy · 2024-05-09T09:35:57Z

@sfc-gh-akashyap yesterday I made this with help of Claude Opus & Mermaid (not 100% sure how much precise in details, but definitely presents the concept reasonably well)

%%{init: {'theme':'neutral'}}%%
flowchart LR
    subgraph Init [" "]
        A[Prepare RAG Data] --> B[Generate Questions]
    end

    C{Question Types}

    C -->|Simple| D[Generate Seed Questions]
    C -->|Complex| E[Generate Reasoning Questions]
    C -->|Conditional| F[Generate Scenario-based Questions]
    C -->|Multi-Context| G[Generate Topic-based Questions]

    D & E & F & G --> .

    . --> |Generate Answers| Answers_Block

    subgraph Answers_Block
        direction TB
        H[Filter and Refine Questions] --> I[Generate Answers]
        I --> GT[Generate Ground Truth]
        GT --> J[Identify Relevant Data Chunks]
        J --> K[Combine Information for Comprehensive Answers]
        K --> L[Create Data Rows]
        L --> M[Combine Questions, Answers, Ground Truth, and Context]
        M --> N[Structure Data Rows]
        N --> JSON[Fix JSON Format]
        JSON --> O[Assess Quality]
        O -->|Low Quality| H
        O -->|High Quality| Test_Set
    end

    style A fill:#1F4E79,color:#FFFFFF
    style B fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style C fill:#FF7F50,color:#FFFFFF
    style D fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style E fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style F fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style G fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style Answers_Block fill:#F0F8FF,stroke:#6495ED,color:#333333
    style H fill:#6495ED,color:#FFFFFF
    style I fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style GT fill:#FFD700,color:#000000,stroke-width:3px
    style K fill:#2E8B57,color:#FFFFFF,stroke-width:3px
    style L fill:#6495ED,color:#FFFFFF
    style O fill:#FF6347,color:#FFFFFF
    style JSON fill:#2E8B57,color:#FFFFFF,stroke-width:3px

sfc-gh-akashyap · 2024-05-09T10:14:58Z

According to this LLM is first generating the question then answer then identifying the chunks from the given data as ground truth. The problem with this approach I can see that LLM is creating it's own answer which should not be done it should first consider the context before answering that way it can better answer. I think there is something else. Please let me know what you think @ciekawy

omkar-334 · 2024-05-15T13:56:51Z

@sfc-gh-akashyap
The testset generator basically has this pipeline.

select a random node/embedding
generate a seed question using that
Get top_k embeddings similar to the random node selected
Based on question_type, prompt formatting
answer the question with obtained top_k context

sfc-gh-akashyap · 2024-05-19T11:10:17Z

Also one more doubt @omkar-334 that while embedding node does it make sure not to lose any tokens. As embeddings have different context length.

sfc-gh-akashyap added the question Further information is requested label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to know how Testset Generator to create dataset works. Can anyone give like flowchart or anything. I wanna know the high level understanding #920

I want to know how Testset Generator to create dataset works. Can anyone give like flowchart or anything. I wanna know the high level understanding #920

sfc-gh-akashyap commented Apr 30, 2024 •

edited

omkar-334 commented May 8, 2024

sfc-gh-akashyap commented May 9, 2024

ciekawy commented May 9, 2024

sfc-gh-akashyap commented May 9, 2024

omkar-334 commented May 15, 2024

sfc-gh-akashyap commented May 19, 2024

I want to know how Testset Generator to create dataset works. Can anyone give like flowchart or anything. I wanna know the high level understanding #920

I want to know how Testset Generator to create dataset works. Can anyone give like flowchart or anything. I wanna know the high level understanding #920

Comments

sfc-gh-akashyap commented Apr 30, 2024 • edited

omkar-334 commented May 8, 2024

sfc-gh-akashyap commented May 9, 2024

ciekawy commented May 9, 2024

sfc-gh-akashyap commented May 9, 2024

omkar-334 commented May 15, 2024

sfc-gh-akashyap commented May 19, 2024

sfc-gh-akashyap commented Apr 30, 2024 •

edited