A Configuration-Driven Framework for Reproducible Information Retrieval Experiments

⚠️ This is a work in progress. Everything is subject to change. Suggestions are more than welcome.

What is this repo?

This repo is a boilerplate for applying Configuration-Driven Experimentation (CDE) to Information Retrieval (IR) research. It supports experiments on MSMARCO Passage Ranking v1 and implements the following Neural Retrieval architectures: BiEncoder, CrossEncoder, ColBERT.

What is Configuration-Driven Experimentation?

Configuration-Driven Experimentation (CDE) aims to bring the main principles of Configuration-Driven Development (CDD) to Computer Science research experiments. CDD is a software development approach that emphasizes the use of configuration files to define and control the behavior of an application. In CDD, instead of hard-coding specific values or logic directly into the source code, developers rely on external configuration files to specify various aspects of the application's behavior.

Why CDE for Information Retrieval?

CDE can be particularly useful for IR experiments for several reasons:

Flexibility in Experiment Setup: IR experiments often involve testing different retrieval models, algorithms, parameters, or data preprocessing techniques. CDD allows researchers to define and modify these experimental configurations without changing the source code directly. This flexibility enables quick iteration and experimentation with various settings.
Reproducibility: Reproducibility is crucial in IR research to validate and compare different approaches. Using configuration files to define the experimental setup, researchers can precisely document the specific configuration used for a particular experiment, making it easier for others to replicate and validate results.
Modularity and Maintainability: IR experiments often involve multiple components, such as indexing, query processing, relevance ranking, and evaluation metrics. CDD allows researchers to modularize these components and configure them separately. Each module can have its own configuration file, making it easier to maintain, update, and reuse components across different experiments.
Customization for Different Scenarios: IR systems may need to be adapted to different domains, data collections, or evaluation scenarios. With CDD, researchers can easily customize the configuration files to adjust the system's behavior and parameters according to specific requirements. This flexibility allows researchers to evaluate and compare different configurations under various scenarios.
Collaboration and Sharing: Configuration files can serve as a common language between researchers, facilitating the collaboration and sharing of experimental setups. Researchers can share their configuration files with others, enabling replication, extension, or modification of experiments while promoting knowledge sharing and fostering a more collaborative research environment.

By adopting Configuration-Driven Development in IR experiments, researchers can streamline the experimentation process, enhance reproducibility, promote collaboration, and facilitate the customization and adaptation of retrieval systems for different scenarios, leading to more robust and reliable research outcomes.

Software stack

Requirements

conda env crete -f env.yml

Currently working on the following features:

Knowledge distillation
Validation set performance monitoring
Early stopping
Faiss integration

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
conf		conf
pipeline		pipeline
src		src
.gitignore		.gitignore
README.md		README.md
env.yml		env.yml
run.sh		run.sh
tb.sh		tb.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

pipeline

pipeline

src

src

.gitignore

.gitignore

README.md

README.md

env.yml

env.yml

run.sh

run.sh

tb.sh

tb.sh

Repository files navigation

A Configuration-Driven Framework for Reproducible Information Retrieval Experiments

What is this repo?

What is Configuration-Driven Experimentation?

Why CDE for Information Retrieval?

Software stack

Requirements

Currently working on the following features:

About

Languages

AmenRa/CDE-IR

Folders and files

Latest commit

History

Repository files navigation

A Configuration-Driven Framework for Reproducible Information Retrieval Experiments

What is this repo?

What is Configuration-Driven Experimentation?

Why CDE for Information Retrieval?

Software stack

Requirements

Currently working on the following features:

About

Topics

Resources

Stars

Watchers

Forks

Languages