Skip to content

Latest commit

 

History

History
294 lines (215 loc) · 13.6 KB

experiments-msmarco-v2-unicoil.md

File metadata and controls

294 lines (215 loc) · 13.6 KB

Pyserini: uniCOIL w/ doc2query-T5 on MS MARCO V2

This guide describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections. Details about our model can be found in the following paper:

Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.

And further detailed in:

Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. Document Expansions and Learned Sparse Lexical Representations for MS MARCO V1 and V2. Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), July 2022.

Here, we start with versions of the MS MARCO V2 corpora that have already been processed with uniCOIL, i.e., we have applied model inference on every document and stored the output sparse vectors.

Quick links:

Passage Ranking (No Expansion)

For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO V1 passage corpus.

To reproduce these runs directly from our pre-built indexes, see our two-click reproduction matrix for MS MARCO V2 passage. The passage ranking experiments here correspond to row (3a) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.

To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar -P collections/
tar -xvf collections/msmarco_v2_passage_unicoil_noexp_0shot.tar -C collections/

To confirm, msmarco_v2_passage_unicoil_noexp_0shot.tar is 24 GB and has an MD5 checksum of d9cc1ed3049746e68a2c91bf90e5212d.

To index the sparse vectors:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco_v2_passage_unicoil_noexp_0shot/ \
  --index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 32 \
  --impact --pretokenized

To perform retrieval:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \
  --topics msmarco-v2-passage-dev \
  --encoder castorini/unicoil-noexp-msmarco-passage \
  --output runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt \
  --batch 144 --threads 36 \
  --hits 1000 \
  --impact

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-passage-dev \
    runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt

Results:
map                   	all	0.1334
recip_rank            	all	0.1343

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-passage-dev \
    runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt

Results:
recall_100            	all	0.4983
recall_1000           	all	0.7010

Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries. To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-passage-dev-unicoil-noexp.

Passage Ranking (With doc2query-T5 Expansion)

After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.

To reproduce these runs directly from our pre-built indexes, see our two-click reproduction matrix for MS MARCO V2 passage. The passage ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.

To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar -P collections/
tar -xvf collections/msmarco_v2_passage_unicoil_0shot.tar -C collections/

To confirm, msmarco_v2_passage_unicoil_0shot.tar is 41 GB and has an MD5 checksum of 1949a00bfd5e1f1a230a04bbc1f01539.

To index the sparse vectors:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco_v2_passage_unicoil_0shot/ \
  --index indexes/lucene-index.msmarco-v2-passage-unicoil-0shot/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 32 \
  --impact --pretokenized

To perform retrieval:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-v2-passage-unicoil-0shot/ \
  --topics msmarco-v2-passage-dev \
  --encoder castorini/unicoil-msmarco-passage \
  --output runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt \
  --batch 144 --threads 36 \
  --hits 1000 \
  --impact

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-passage-dev \
    runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt

Results:
map                     all     0.1488
recip_rank              all     0.1501

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-passage-dev \
    runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt

Results:
recall_100              all     0.5515
recall_1000             all     0.7613

Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries. To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-passage-dev-unicoil.

Document Ranking (No Expansion)

For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO V1 passage corpus. When performing inference on the documents using the uniCOIL model here, we prepended the document title to provide context. This is more effective than not prepending the title, which is also a condition that we have tried.

To reproduce these runs directly from our pre-built indexes, see our two-click reproduction matrix for MS MARCO V2 doc. The document ranking experiments here correspond to row (3a) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.

To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar -P collections/
tar -xvf collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar -C collections/

To confirm, msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar is 55 GB and has an MD5 checksum of 97ba262c497164de1054f357caea0c63.

To index the sparse vectors:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2/ \
  --index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot-v2/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 32 \
  --impact --pretokenized

To perform retrieval:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-noexp-0shot-v2/ \
  --topics msmarco-v2-doc-dev \
  --encoder castorini/unicoil-noexp-msmarco-passage \
  --output runs/run.msmarco-v2-doc-segmented-unicoil-noexp-0shot-v2.dev.txt \
  --batch 144 --threads 36 \
  --hits 10000 --max-passage --max-passage-hits 1000 \
  --impact

For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
    runs/run.msmarco-v2-doc-segmented-unicoil-noexp-0shot-v2.dev.txt

Results:
map                   	all	0.2206
recip_rank            	all	0.2232

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
    runs/run.msmarco-v2-doc-segmented-unicoil-noexp-0shot-v2.dev.txt

Results:
recall_100            	all	0.7460
recall_1000           	all	0.8987

We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries. To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-doc-dev-unicoil-noexp.

Document Ranking (With doc2query-T5 Expansion)

After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions. When performing inference on the documents using the uniCOIL model here, we prepended the document title to provide context. This is more effective than not prepending the title, which is also a condition that we have tried.

To reproduce these runs directly from our pre-built indexes, see our two-click reproduction matrix for MS MARCO V2 doc. The document ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.

To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar -P collections/
tar -xvf collections/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar -C collections/

To confirm, msmarco_v2_doc_segmented_unicoil_0shot_v2.tar is 72 GB and has an MD5 checksum of c5639748c2cbad0152e10b0ebde3b804.

To index the sparse vectors:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco_v2_doc_segmented_unicoil_0shot_v2/ \
  --index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-0shot-v2/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 32 \
  --impact --pretokenized

To perform retrieval:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-v2-doc-segmented-unicoil-0shot-v2/ \
  --topics msmarco-v2-doc-dev \
  --encoder castorini/unicoil-msmarco-passage \
  --output runs/run.msmarco-v2-doc-segmented-unicoil-0shot-v2.dev.txt \
  --batch 144 --threads 36 \
  --hits 10000 --max-passage --max-passage-hits 1000 \
  --impact

For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
    runs/run.msmarco-v2-doc-segmented-unicoil-0shot-v2.dev.txt

Results:
map                     all     0.2388
recip_rank              all     0.2419

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
    runs/run.msmarco-v2-doc-segmented-unicoil-0shot-v2.dev.txt

Results:
recall_100              all     0.7789
recall_1000             all     0.9120

We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries. To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-doc-dev-unicoil.

Reproduction Log*