Skip to content

firstbatchxyz/rag-evaluations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HotpotQA Model Evaluator with Dria's Public RAG Model

This project is designed to evaluate different models on the HotpotQA dataset. It uses a multi-threaded approach to evaluate models concurrently, providing a comprehensive and efficient evaluation process.

Methodology

The dataset for evaluation is HotpotQA, a dataset with 113k Wikipedia-based question-answer pairs.

We evaluate the validation subset of the dataset, which contains 7405 question-answer pairs.

Used Models

  • Mixtral 8x7B
  • Command R
  • Meta Llama 70B
  • Meta Llama 13B
  • GPT3.5-Turbo

Project includes a Judge class for GPT-as-a-judge. This class is used to evaluate generations by LLMs using gpt-4-0125-preview

The HotpotQA dataset is loaded and split into rows, with each row being evaluated concurrently by a worker thread.

Evaluation Process

System Prompts of evaluators

For Replicate and OpenAI:

"Step 1: Analyze context for answering questions.\n"
"Step 2: Decide context is relevant with question or not relevant with question.\n "
"Step 3: If any topic about question mentioned in context, use that information for question.\n "
"Step 4: If context has not mention on question, ignore that context I give you and use your self knowledge.\n "
"Step 5: Answer the question.\n "

For Cohere, there is no system prompt.

Query Prompts for evaluators

For all models, the query prompt is as follows:

'''
{context}
'''
**Question**: {question}"

For Cohere, context is not included in the query prompt. Instead, it is passed as a separate parameter.

The evaluation process is as follows:

  1. Initially, we analyzed vanilla model responses, assessed by Judge LLM. Next, we extracted context for each question using Dria, then calculated similarity scores. Questions below the threshold were excluded from the evaluation cluster.

  2. For included questions:

    • Context is retrieved from the Local Wikipedia Index using Dria.
    • Number of article retrieved from Dria is 1.
    • Context is split into smaller chunks.
    • Select two chunks with maximum number of shared keywords.
  3. The context is then used to evaluate the RAG model.

  4. The response generated by the RAG model and Simple model is then evaluated by the GPT 4.5 determine if the response aligns with the correct answer.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Installing Dria CLI

This project also requires the Dria CLI. You can install it by following the instructions on its GitHub page.

Setup Local Wikipedia with Dria CLI

After installing the Dria CLI, you should fetch Wikipedia Index with Dria and serve it locally. You can do this by running the following commands in your terminal:

dria fetch uaBIB4kh7gYh6vSNL7V2eygfbyRu9vGZ_nJ6jKVn_x8 # Transaction/Contract ID of Wikipedia
dria serve uaBIB4kh7gYh6vSNL7V2eygfbyRu9vGZ_nJ6jKVn_x8

Project Dependencies

To install the necessary dependencies, run the following command in your terminal:

pip install -r requirements.txt

Used APIs and Services

Using the project requires access to the following APIs and services:

  • Cohere API: Required for Cohere model evaluations.
  • OpenAI API: Required for any evaluation.
  • Replicate API: Required for any evaluation.

You will need to obtain API keys for these services and set them as environment variables in your terminal. The environment variables are as follows:

  • COHERE_API_KEY
  • OPENAI_API_KEY
  • REPLICATE_API_KEY

Usage

To run the main script, use the following command:

python main.py --max_worker <max_worker> --output_dir <output_dir> --dataset_slice <dataset_slice>

Replace <max_worker>, <output_dir>, and <dataset_slice> with your desired values.

  • <max_worker>: The maximum number of worker threads for concurrent model evaluation.
  • <output_dir>: The directory where the evaluation results will be saved.
  • <dataset_slice>: The percentage of the HotpotQA dataset to be used for evaluation.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

About

Evaluation repository of wikipedia index with Dria

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages