HotpotQA Model Evaluator with Dria's Public RAG Model

This project is designed to evaluate different models on the HotpotQA dataset. It uses a multi-threaded approach to evaluate models concurrently, providing a comprehensive and efficient evaluation process.

Methodology

The dataset for evaluation is HotpotQA, a dataset with 113k Wikipedia-based question-answer pairs.

We evaluate the validation subset of the dataset, which contains 7405 question-answer pairs.

Used Models

Mixtral 8x7B
Command R
Meta Llama 70B
Meta Llama 13B
GPT3.5-Turbo

Project includes a Judge class for GPT-as-a-judge. This class is used to evaluate generations by LLMs using gpt-4-0125-preview

The HotpotQA dataset is loaded and split into rows, with each row being evaluated concurrently by a worker thread.

Evaluation Process

System Prompts of evaluators

For Replicate and OpenAI:

"Step 1: Analyze context for answering questions.\n"
"Step 2: Decide context is relevant with question or not relevant with question.\n "
"Step 3: If any topic about question mentioned in context, use that information for question.\n "
"Step 4: If context has not mention on question, ignore that context I give you and use your self knowledge.\n "
"Step 5: Answer the question.\n "

For Cohere, there is no system prompt.

Query Prompts for evaluators

For all models, the query prompt is as follows:

'''
{context}
'''
**Question**: {question}"

For Cohere, context is not included in the query prompt. Instead, it is passed as a separate parameter.

The evaluation process is as follows:

Initially, we analyzed vanilla model responses, assessed by Judge LLM. Next, we extracted context for each question using Dria, then calculated similarity scores. Questions below the threshold were excluded from the evaluation cluster.
For included questions:
- Context is retrieved from the Local Wikipedia Index using Dria.
- Number of article retrieved from Dria is 1.
- Context is split into smaller chunks.
- Select two chunks with maximum number of shared keywords.
The context is then used to evaluate the RAG model.
The response generated by the RAG model and Simple model is then evaluated by the GPT 4.5 determine if the response aligns with the correct answer.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Installing Dria CLI

This project also requires the Dria CLI. You can install it by following the instructions on its GitHub page.

Setup Local Wikipedia with Dria CLI

After installing the Dria CLI, you should fetch Wikipedia Index with Dria and serve it locally. You can do this by running the following commands in your terminal:

dria fetch uaBIB4kh7gYh6vSNL7V2eygfbyRu9vGZ_nJ6jKVn_x8 # Transaction/Contract ID of Wikipedia
dria serve uaBIB4kh7gYh6vSNL7V2eygfbyRu9vGZ_nJ6jKVn_x8

Project Dependencies

To install the necessary dependencies, run the following command in your terminal:

pip install -r requirements.txt

Used APIs and Services

Using the project requires access to the following APIs and services:

Cohere API: Required for Cohere model evaluations.
OpenAI API: Required for any evaluation.
Replicate API: Required for any evaluation.

You will need to obtain API keys for these services and set them as environment variables in your terminal. The environment variables are as follows:

COHERE_API_KEY
OPENAI_API_KEY
REPLICATE_API_KEY

Usage

To run the main script, use the following command:

python main.py --max_worker <max_worker> --output_dir <output_dir> --dataset_slice <dataset_slice>

Replace <max_worker>, <output_dir>, and <dataset_slice> with your desired values.

<max_worker>: The maximum number of worker threads for concurrent model evaluation.
<output_dir>: The directory where the evaluation results will be saved.
<dataset_slice>: The percentage of the HotpotQA dataset to be used for evaluation.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

CONTRIBUTING.md

CONTRIBUTING.md

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

HotpotQA Model Evaluator with Dria's Public RAG Model

Methodology

Used Models

Evaluation Process

System Prompts of evaluators

Query Prompts for evaluators

The evaluation process is as follows:

Getting Started

Installing Dria CLI

Setup Local Wikipedia with Dria CLI

Project Dependencies

Used APIs and Services

Usage

Contributing

About

Releases

Packages

Languages

firstbatchxyz/rag-evaluations

Folders and files

Latest commit

History

Repository files navigation

HotpotQA Model Evaluator with Dria's Public RAG Model

Methodology

Used Models

Evaluation Process

System Prompts of evaluators

Query Prompts for evaluators

The evaluation process is as follows:

Getting Started

Installing Dria CLI

Setup Local Wikipedia with Dria CLI

Project Dependencies

Used APIs and Services

Usage

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Languages