Skip to content

LangChain, Llama2-Chat, and zero- and few-shot prompting are used to generate synthetic datasets for IR and RAG system evaluation

License

Notifications You must be signed in to change notification settings

mddunlap924/LangChain-SynData-RAG-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Data Generation using LangChain for IR and RAG Evaluation

This repository demonstrates LangChain, Llama2-Chat, and zero- and few-shot prompt engineering to enable synthetic data generation for Information Retrieval (IR) and Retrieval Augmented Generation (RAG) evaluation.

Introduction  •  Highlights  •  Example Notebooks  •  Background  •  Metrics  •  Benefits  •  Prompt Templates  •  Issues  •  TODOs

Introduction

Large language models (LLMs) have transformed Information Retrieval (IR) and search by comprehending complex queries. This repository showcases concepts and packages that can be used to generate sophisticated synthetic datasets for IR and Retrieval Augmented Generation (RAG) evaluation.

The synthetic data generated is a query and answer for a given context. An example of a synthetically generated context-query-answer is shown below:

Provided Context (usually split from documents / text sources): 
Pure TalkUSA is an American mobile virtual network operator headquartered in Covington, Georgia, United States. 
It is most notable for an industry-first offering of rollover data in their data add-on packages, which has since been discontinued. 
Pure TalkUSA is a subsidiary of Telrite Corporation. Bring Your Own Phone! 

Synthetically Generated Query: 
What was the outstanding service offered by Pure TalkUSA?

Synthetically Generated Answer:
The outstanding service from Pure TalkUSA was its industry-first offering of rollover data.

When building an IR or RAG system, a dataset of context, queries, and answers is vital for evaluating the system's performance. Human-annotated datasets offer excellent ground truths but can be expensive and challenging to obtain; therefore, synthetic datasets generated using LLMs is an attractive solution and supplement.

By employing LLM prompt engineering, a diverse range of synthetic queries and answers can be generated to form a robust validation dataset. This repository showcases a process to generate synthetic data while emphasizing zero- and few-shot prompting for creating highly customizable synthetic datasets. Figure 1 outlines the synthetic dataset generation process demonstrated in this repository.


Figure 1: Synthetic Data Generation for IR and RAG Evaluation

NOTE: Refer to the Background and Metrics sections for a deeper dive on IR, RAG, and how to evaluate these systems.

Highlights

A few of the key highlights in repository are:

  • Local LLM models on consumer grade hardware are exclusively used throughout and no external API calls are performed. This is paramount for data privacy. Also, several online examples utilize external API calls to State-of-the-Art (SOTA) LLMs which generally provide higher quality results than local LLMs with less parameters. This causes certain challenges in coding and error handling for local models and solutions are shown here.
  • Zero- and Few-Shot Prompting for highly customizable query and answer generation are presented.
  • LangChain examples using:
    • Custom prompt engineering,
    • Output parsers and auto-fixing parsers to obtain structured data,
    • Batch GPU inference with chains,
    • LangChain Expression Language (LCEL).
  • Quantization for reducing model size onto consumer grade hardware.

Example Notebooks

Context-Query-Answer Generation with LangChain

1.) LangChain with Custom Prompts and Output Parsers for Structured Data Output: see gen-question-answer-query.ipynb for an example of synthetic context-query-answer data generation. Key aspects of this notebook are:

Context-Query Generation with LangChain

1.) LangChain Custom Llama2-Chat Prompting: See qa-gen-query-langchain.ipynb for an example of how to build LangChain Custom Prompt Templates for context-query generation. A few of the LangChain features shown in this notebook are:

Context-Query Generation without LangChain

1.) Zero- and Few-Shot Prompt Engineering: See qa-gen-query.ipynb for an example of synthetic context-query data generation for custom datasets. Key features presented here are:

  • Prompting LLMs using zero- and few-shot annotations on the SquadV2 question-answering dataset.
  • Demonstrates Two prompting techniques:
    • Basic zero-shot query generation which is referred to as vanilla
    • Few-shot with Guided by Bad Questions (GBQ)

2.) Context-Arugment: See argument-gen-query.ipynb for examples of synthetic context-query data for argument retrieval tasks. In the context of information retrieval, these tasks are designed to retrieve relevant arguments from various sources such as documents. In argument retrieval the goal is to provide users with persuasive and credible information to support their arguments or make informed decisions.

Non-Llama Query Generation

Other examples of query specific generation models (e.g., BeIR/query-gen-msmarco-t5-base-v1) can readily be found online (see BEIR Question Generation).

Background

The primary function of an IR system is retrieval, which aims to determine the relevance between a users' query and the content to be retrieved. Implementing an IR or RAG system demands user-specific documents. However, lacking annotated datasets for custom datasets hampers system evaluation. Figure 2 provides an overview of a typical RAG process for a question-answering system.


Figure 2: RAG process overview [Source].

This synthetic context-query-answer datasets are crucial for evaluating: 1) the IR's systems ability to select the enhanced context as illustrated in Figure 2 - Step #3, and 2) the RAG's generated response as shown in Figure 2 - Step #5. By allowing offline evaluation, it enables a thorough analysis of the system's balance between speed and accuracy, informing necessary revisions and selecting champion system designs.

The design of IR and RAG systems are becoming more complicated as referenced in Figure 3.

llms-ir
Figure 3: LLMs can be used in query rewriter, retriever, reranker, and reader [Source]

As shown their are several considerations in IR / RAG design and solutions can range in complexity from traditional methods (e.g., term-based sparse methods) to neural based methods (e.g., embeddings and LLMs). Evaluation of these systems is critical to making well-informed design decisions. From search to recommendations, evaluation measures are paramount to understanding what does and does not work in retrieval.

Metrics

Question-Answering (QA) systems (e.g., RAG system) have two components:

  1. Retriever - which retrieves the most relevant information needed to answer the query
  2. Generator - which generates the answer with the retrieved information.

When evaluating a QA system both components need to be evaluated separately and together to get an overall system score.

Whenever a question is asked to a RAG application, the following objects can be considered [Source]:

  • The question
  • The correct answer to the question
  • The answer that the RAG application returned
  • The context that the RAG application retrieved and used to answer the question

The selection of metrics is not a primary focus of this repository since metrics are application dependent; however reference articles and information are provided for convenience.

Retriever Metrics

Figure 4 shows common evaluation metrics for IR and the Dataset from Figure 1 can be used for the Offline Metrics shown in Figure 4.

eval-metrics
Figure 4: Ranking evaluation metrics [Source]

Offline metrics are measured in an isolated environment before deploying a new IR system. These look at whether a particular set of relevant results are returned when retrieving items with the system [Source].

Generator Metrics

A brief review of generator metrics will showcase a few tiers of metric complexity. When evaluating the generator, look at whether, or to what extent, the selected answer passages match the correct answer or answers.

Provided below are generator metrics listed in order of least to most complex.

  • Traditional: metrics such as F1, Accuracy, Exact Match, ROGUE, BLEU, etc. can be performed but these will lack correlation with human judgement; however, they do offer simple and quick quantitative comparisons.
  • Semantic Answer Similarity: encoder models like SAS, BERT, and other models available on Sentence-Transformers. These are trained models that return similarity scores.
  • Using LLMs to evaluate themselves: this is the inner workings of popular RAG evaluation packages like Ragas and TonicAI/tvalmetrics.

Please refer to the article Deepset: Metrics to Evaluate a Question Answering System and Evaluating RAG pipelines with Ragas + LangSmith that elaborate on these metrics.

Benefits

A few key benefits of synthetic data generation with LLM prompt engineering are:

  • Customized IR Task Query Generation: Prompting LLMs offer great flexibility in the types of queries that can be generated. This is helpful because IR tasks vary in their application. For example, Benchmarking-IR (BEIR) is a heterogeneous benchmark containing diverse IR tasks such as question-answering, argument or counter argument retrieval, fact checking, etc. Due to the diversity in IR tasks this is where the benefits of LLM prompting can excellence because the prompt can be tailored to generate synthetic data to the IR task. Figure 5 shows an overview of the diverse IR tasks and datasets in BEIR. Refer to the BEIR leaderboard to see the performance of NLP-based retrieval models.

this is some info.
Figure 5: BEIR benchmark datasets and IR tasks Image taken from [Source]

  • Zero or Few-Shot Annotations: In a technique referred to as zero or few-shot prompting, developers can provide domain-specific example queries to LLMs, greatly enhancing query generation. This approach often requires only a handful of annotated samples.
  • Longer Context Length: GPT-based LLM models, like Llama2, provide extended context lengths, up to 4,096 tokens compared to BERT's 512 tokens. This longer context enhances document parsing and query generation control.

Prompt Templates

Llama2 will be used in this repository for generating synthetic queries because it can be ran locally on consumer grade GPUs. Shown below is the prompt template for Llama2 Chat which was fine-tuned for dialogue and instruction applications.

<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>

{user_message_1} [/INST]
  • System Prompt: A system prompt <<SYS>> is one of the unsung advantages of open-access models is that you have full control over the system prompt in chat applications. This is essential to specify the behavior of your chat assistant –and even imbue it with some personality–, but it's unreachable in models served behind APIs [Source].
  • User Message: The query or message provided by the user. The [INST] and [/INST] help identify what was typed by the user so Llama knows how to respond properly. Without these markers around the user text, Llama may get confused about whose turn it is to reply.

Note that base Llama2 models have no prompt structure because they are raw non-instruct tuned models [Source].

Additional resources and references to help with prompting techniques and basics:

Issues

This repository is will do its best to be maintained. If you face any issue or want to make improvements please raise an issue or submit a Pull Request. 😃

TODOs

  • DeepSpeed ZeRO-Inference Offload massive LLM weights to non-GPU resources for running +70B models on consumer grade hardware.
  • Feel free to raise an Issue for a feature you would like to see added.

Liked the work? Please give a star!

About

LangChain, Llama2-Chat, and zero- and few-shot prompting are used to generate synthetic datasets for IR and RAG system evaluation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published