GitHub - thongnt99/lsr-long: SIGIR 2023: Adapting Learned Sparse Retrieval to Long Documents

Code for SIGIR 2023 paper: Adapting Learned Sparse Retrieval to Long Documents

Installation

Python packages

conda create --name lsr python=3.9.12
conda activate lsr
pip install -r requirements.txt

Anserini for inverted indexing & retrieval: Clone and compile anserini-lsr, a customized version of Anserini for learned sparse retrieval. When compiling, add -Dmaven.test.skip=true to skip the tests.

Downloading and spliting data

MSMARCO Documents

bash scripts/prepare_msmarco_doc.sh

TREC-Robust04

bash scripts/prepare_robust04.sh

BM25 baselines

bash scripts/run_bm25_msmarco.sh 
bash scripts/run_bm25_robust04.sh

Simple aggregation

To perform aggregation on MSMARCO, follow these steps. For TREC-Robust04, please modify the input and output files accordingly.

1. Running inferences on segments (passages) and queries:

segment inference (can be distributed on multiple gpus to speed up)

for i in {1..60}
do
input_path=data/msmarco_doc/splits_psg/part$(printf "%02d" $i)
output_path=data/msmarco_doc/vectors/part$(printf "%02d" $i)
batch_size=256
type='doc'
python -m lsr.inference --inp $input_path --out $output_path --type $type --bs $batch_size
done

query inference

input_path=data/msmarco_doc/msmarco-docdev-queries.tsv
output_path=data/msmarco_doc/query.tsv
batch_size=256
type='query'
python -m lsr.inference --inp $input_path --out $output_path --type $type --bs $batch_size

2. Aggregating

Representation aggregation

bash scripts/aggregate_rep_msmarco_doc.sh

Score (max) aggregation

bash scripts/aggregate_score_msmarco_doc.sh

ExactSDM and SoftSDM

ExactSDM

Estimating weights/Evaluating on MSMARCO Documents

#Passages	MRR@10	R@1000	Script
1	37.08	95.49	`scripts/train_script_exact_sdm_long_reranker_1_psg.sh`
2	37.45	96.51	`scripts/train_script_exact_sdm_long_reranker_2_psg.sh`
3	37.36	96.76	`scripts/train_script_exact_sdm_long_reranker_3_psg.sh`
4	37.03	96.71	`scripts/train_script_exact_sdm_long_reranker_4_psg.sh`
5	36.95	96.61	`scripts/train_script_exact_sdm_long_reranker_5_psg.sh`

Evaluating on TREC Robust04 (zero-shot)

bash scripts/evaluate_exact_sdm_trec_robust04.sh

SoftSDM

Estimating weights/Evaluating on MSMARCO Documents
Note: using +model.window_sizes=[1,2] +model.proximity=8 generally leads to better performance on MSMARCO document but hurts TREC-Robust04 scores.

#Passages	MRR@10	R@1000	Script
1	36.98	95.49	`scripts/train_script_sdm_long_reranker_1_psg.sh`
2	37.53	96.51	`scripts/train_script_sdm_long_reranker_2_psg.sh`
3	37.41	96.76	`scripts/train_script_sdm_long_reranker_3_psg.sh`
4	36.80	96.71	`scripts/train_script_sdm_long_reranker_4_psg.sh`
5	36.79	96.61	`scripts/train_script_sdm_long_reranker_5_psg.sh`

Evaluating on TREC Robust04 (zero-shot)

bash scripts/evaluate_sdm_trec_robust04.sh

Citing and Authors

If you find this repository helpful, please cite our following papers:

Adapting Learned Sparse Retrieval for Long Documents

@inproceedings{nguyen:sigir2023-llsr,
  author = {Nguyen, Thong and MacAvaney, Sean and Yates, Andrew},
  title = {Adapting Learned Sparse Retrieval for Long Documents},
  booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year = {2023}
}

A Unified Framework for Learned Sparse Retrieval

@inproceedings{nguyen2023unified,
  title={A Unified Framework for Learned Sparse Retrieval},
  author={Nguyen, Thong and MacAvaney, Sean and Yates, Andrew},
  booktitle={Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part III},
  pages={101--116},
  year={2023},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
data/msmarco_doc		data/msmarco_doc
lsr		lsr
run_files_top200		run_files_top200
scripts		scripts
slurm_scripts		slurm_scripts
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/msmarco_doc

data/msmarco_doc

lsr

lsr

run_files_top200

run_files_top200

scripts

scripts

slurm_scripts

slurm_scripts

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Installation

Downloading and spliting data

BM25 baselines

Simple aggregation

1. Running inferences on segments (passages) and queries:

2. Aggregating

ExactSDM and SoftSDM

ExactSDM

SoftSDM

Citing and Authors

About

Releases 1

Packages

Languages

thongnt99/lsr-long

Folders and files

Latest commit

History

Repository files navigation

Installation

Downloading and spliting data

BM25 baselines

Simple aggregation

1. Running inferences on segments (passages) and queries:

2. Aggregating

ExactSDM and SoftSDM

ExactSDM

SoftSDM

Citing and Authors

About

Topics

Resources

Stars

Watchers

Forks

Languages