[NeurIPS 2023] Model-enhanced Vector Index (Paper)

Environment

[Option 1] Create conda environment:

conda env create -f environment.yml
conda activate mevi

[Option 2] Use docker:

docker pull hugozhl/nci:latest

MSMARCO Passage

Data Process

[1] Download and preprocess:

bash dataprocess/msmarco_passage/download_data.sh
python dataprocess/msmarco_passage/prepare_origin.py \
	--data_dir data/marco --origin

[2] Tokenize documents:

# tokenize for T5-ANCE and AR2

# T5-ANCE
python dataprocess/msmarco_passage/prepare_passage_tokenized.py \
	--output_dir data/marco/ance \
	--document_path data/marco/raw/corpus.tsv \
	--dataset marco --model ance
rm data/marco/ance/all_document_indices_*.pkl
rm data/marco/ance/all_document_tokens_*.bin
rm data/marco/ance/all_document_masks_*.bin

# AR2
python dataprocess/msmarco_passage/prepare_passage_tokenized.py \
	--output_dir data/marco/ar2 \
	--document_path data/marco/raw/corpus.tsv \
	--dataset marco --model ar2
rm data/marco/ar2/all_document_indices_*.pkl
rm data/marco/ar2/all_document_tokens_*.bin
rm data/marco/ar2/all_document_masks_*.bin

[3] Query generation for augmentation:

We used the docT5query checkpoint as in NCI. The QG data is only for training.

Please download the finetuned docT5query ckpt to data/marco/ckpts/doc2query-t5-base-msmarco

# MUST download the finetuned docT5query ckpt before running the scripts
python dataprocess/msmarco_passage/doc2query.py --data_dir data/marco
# if the qg data has bad quality, e.g. empty query or many duplicate queries, add another script below
python dataprocess/msmarco_passage/complement_qg10.py --data_dir data/marco # Optional

[4] Generate document embeddings and construct RQ

For T5-ANCE, please download T5-ANCE checkpoint to data/marco/ckpts/t5-ance.

For AR2, please download AR2 checkpoint to data/marco/ckpts/ar2g_marco_finetune.pkl and coCondenser checkpoint to data/marco/ckpts/co-condenser-marco-retriever

# MUST download the checkpoints before running the scripts
export DOCUMENT_ENCODER=ance
# export DOCUMENT_ENCODER=ar2 # use this line for ar2
bash MEVI/marco_generate_embedding_n_rq.sh

Training

Train the RQ-based NCI.

export DOCUMENT_ENCODER=ance
# export DOCUMENT_ENCODER=ar2 # use this line for ar2
export WANDB_TOKEN="your wandb token"
bash MEVI/marco_train_nci_rq.sh

Twin-tower Model Evaluation

First generate query embeddings.

# for T5-ANCE
python MEVI/generate.py \
	--query_file data/marco/origin/dev_mevi_dedup.tsv \
	--model_path data/marco/ckpts/t5-ance \
	--tokenizer_path data/marco/ckpts/t5-ance \
	--query_embedding_path data/marco/ance/query_emb.bin \
	--gpus 0,1,2,3,4,5,6,7 --gen_query

# for AR2
python MEVI/generate.py \
	--query_file data/marco/origin/dev_mevi_dedup.tsv \
	--model_path data/marco/ckpts/ar2g_marco_finetune.pkl \
	--tokenizer_path bert-base-uncased \
	--query_embedding_path data/marco/ar2/query_emb.bin \
	--gpus 0,1,2,3,4,5,6,7 --gen_query

Then use faiss for ANN search.

# for T5-ANCE; if for AR2, change the ance directory to ar2 directory
python MEVI/faiss_search.py \
	--query_path data/marco/ance/query_emb.bin \
	--doc_path data/marco/ance/docemb.bin \
	--output_path data/marco/ance/hnsw256.txt \
	--raw_query_path data/marco/origin/dev_mevi_dedup.tsv \
	--param HNSW256

Sequence-to-sequence Model Evaluation

Please download our checkpoint for MSMARCO Passage or train from scratch before evaluation, and put the checkpoint in data/marco/ckpts. If using the downloaded checkpoint, please also download the corresponding RQ files.

# MUST download or train a ckpt before running the scripts
export DOCUMENT_ENCODER=ance
# export DOCUMENT_ENCODER=ar2 # use this line for ar2
bash MEVI/marco_eval_nci_rq.sh

Ensemble

Ensemble the results from the twin-tower model and the sequence-to-sequence model.

export DOCUMENT_ENCODER=ance
# export DOCUMENT_ENCODER=ar2 # use this line for ar2
bash MEVI/marco_ensemble.sh

Natural Questions (DPR version)

Data Process

[1] Download and preprocess:

bash dataprocess/NQ_dpr/download_data.sh
python dataprocess/NQ_dpr/preprocess.py --data_dir data/nq_dpr

[2] Tokenize documents:

# use AR2
python dataprocess/NQ_dpr/tokenize_passage_ar2.py \
	--output_dir data/nq_dpr \
	--document_path data/nq_dpr/corpus.tsv
rm data/nq_dpr/all_document_indices_*.pkl
rm data/nq_dpr/all_document_tokens_*.bin
rm data/nq_dpr/all_document_masks_*.bin

[3] Query generation for augmentation:

We used the docT5query checkpoint as in NCI. The QG data is only for training. Please refer to the QG section for MSMARCO Passage.

# download finetuned docT5query ckpt to data/marco/ckpts/doc2query-t5-base-msmarco
python dataprocess/NQ_dpr/doc2query.py \
	--data_dir data/nq_dpr --n_gen_query 1 \
	--ckpt_path data/marco/ckpts/doc2query-t5-base-msmarco

[4] Generate document embeddings and construct RQ

Please download AR2 checkpoint to data/marco/ckpts/ar2g_nq_finetune.pkl and ERNIE checkpoint to data/marco/ckpts/ernie-2.0-base-en

# MUST download the checkpoints before running the scripts
bash MEVI/nqdpr_generate_embedding_n_rq.sh

[5] Tokenize query

Since NQ has too many augmented queries, to eliminate runtime memory usage, we tokenize query to enable memmap.

python dataprocess/NQ_dpr/tokenize_query.py \
	--output_dir data/nq_dpr \
	--tok_train 1 --tok_corpus 1 --tok_qg 1

[6] Get answers

We sort the answers for fast evaluation. (Time-consuming! Please download the processed binary files if necessary.)

python dataprocess/NQ_dpr/get_answers.py \
	--data_dir data/nq_dpr \
	--dev 1 --test 1
python dataprocess/NQ_dpr/get_inverse_answers.py \
	--data_dir data/nq_dpr \
	--dev 1 --test 1

Training

Train the RQ-based NCI.

export WANDB_TOKEN="your wandb token"
bash MEVI/nqdpr_train_nci_rq.sh

Twin-tower Model Evaluation

First generate query embeddings.

python MEVI/generate.py \
	--query_file data/nq_dpr/nq-test.qa.csv \
	--model_path data/marco/ckpts/ar2g_nq_finetune.pkl \
	--tokenizer_path bert-base-uncased \
	--query_embedding_path data/nq_dpr/query_emb.bin \
	--gpus 0,1,2,3,4,5,6,7 --gen_query

Then use faiss for ANN search.

python MEVI/faiss_search.py \
	--query_path data/nq_dpr/query_emb.bin \
	--doc_path data/nq_dpr/docemb.bin \
	--output_path data/nq_dpr/hnsw256.txt \
	--raw_query_path data/nq_dpr/nq-test.qa.csv \
	--param HNSW256

Sequence-to-sequence Model Evaluation

Please download our checkpoint for NQ or train from scratch before evaluation, and put the checkpoint in data/marco/ckpts. If using the downloaded checkpoint, please also download the corresponding RQ files.

# MUST download or train a ckpt before running the scripts
bash MEVI/nqdpr_eval_nci_rq.sh

Ensemble

Ensemble the results from the twin-tower model and the sequence-to-sequence model.

bash MEVI/nqdpr_ensemble.sh

Citation

If you find this work useful, please cite our paper.

Acknowledgement

We learned a lot and borrowed some code from the following projects when building MEVI.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
MEVI		MEVI
dataprocess		dataprocess
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MEVI

MEVI

dataprocess

dataprocess

.gitignore

.gitignore

README.md

README.md

environment.yml

environment.yml

Repository files navigation

[NeurIPS 2023] Model-enhanced Vector Index (Paper)

Environment

MSMARCO Passage

Data Process

Training

Twin-tower Model Evaluation

Sequence-to-sequence Model Evaluation

Ensemble

Natural Questions (DPR version)

Data Process

Training

Twin-tower Model Evaluation

Sequence-to-sequence Model Evaluation

Ensemble

Citation

Acknowledgement

About

Releases

Packages

Languages

PKU-DAIR/MEVI

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS 2023] Model-enhanced Vector Index (Paper)

Environment

MSMARCO Passage

Data Process

Training

Twin-tower Model Evaluation

Sequence-to-sequence Model Evaluation

Ensemble

Natural Questions (DPR version)

Data Process

Training

Twin-tower Model Evaluation

Sequence-to-sequence Model Evaluation

Ensemble

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Languages