Skip to content

Latest commit

 

History

History
233 lines (171 loc) · 10.4 KB

experiments-unicoil.md

File metadata and controls

233 lines (171 loc) · 10.4 KB

Pyserini: uniCOIL w/ doc2query-T5 on MS MARCO V1

This guide describes how to reproduce the uniCOIL experiments in the following paper:

Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.

And further detailed in:

Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. Document Expansions and Learned Sparse Lexical Representations for MS MARCO V1 and V2. Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), July 2022.

Here, we start with versions of the MS MARCO V1 corpora that have already been processed with uniCOIL, i.e., we have applied model inference on every document and stored the output sparse vectors.

Quick Links:

Passage Ranking

To reproduce these runs directly from our pre-built indexes, see our two-click reproduction matrix for MS MARCO V1 passage. The passage ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.

Corpus Download

We're going to use the Pyserini repository's root directory as the working directory. First, we need to download and unpack the corpus:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil.tar -P collections/
tar xvf collections/msmarco-passage-unicoil.tar -C collections/

To confirm, msmarco-passage-unicoil.tar is 3.4 GB and has MD5 checksum 78eef752c78c8691f7d61600ceed306f.

Indexing

We can now index these docs:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco-passage-unicoil/ \
  --index indexes/lucene-index.msmarco-passage-unicoil/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 12 \
  --impact --pretokenized

The important indexing options to note here are --impact --pretokenized: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.

Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes.

Retrieval

We can now run retrieval using the castorini/unicoil-msmarco-passage model available on Huggingface's model hub to encode the queries:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-passage-unicoil/ \
  --topics msmarco-passage-dev-subset \
  --encoder castorini/unicoil-msmarco-passage \
  --output runs/run.msmarco-passage.unicoil.tsv \
  --output-format msmarco \
  --batch 36 --threads 12 \
  --hits 1000 \
  --impact

Here, we are using the transformer model to encode the queries on the fly using the CPU. Note that the important option here is --impact, where we specify impact scoring. With these impact scores, query evaluation is already slower than bag-of-words BM25; on top of that we're adding neural inference on the CPU. A complete run typically takes around 30 minutes.

The output is in MS MARCO output format, so we can directly evaluate:

$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil.tsv

#####################
MRR @10: 0.3508734138354477
QueriesRanked: 6980
#####################

There might be small differences in score due to non-determinism in neural inference; see these notes for details. The above score was obtained on Linux.

Alternatively, we can use pre-tokenized queries with pre-computed weights, which are already included in Pyserini. We can run retrieval as follows:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-passage-unicoil/ \
  --topics msmarco-passage-dev-subset-unicoil \
  --output runs/run.msmarco-passage.unicoil.tsv \
  --output-format msmarco \
  --batch 36 --threads 12 \
  --hits 1000 \
  --impact

Here, we also specify --impact for impact scoring. Since we're not applying neural inference over the queries, speed is faster, typically less than 10 minutes.

The output is in MS MARCO output format, so we can directly evaluate:

$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil.tsv

#####################
MRR @10: 0.35155222404147896
QueriesRanked: 6980
#####################

Note that in this case, the results should be deterministic.

Document Ranking

To reproduce these runs directly from our pre-built indexes, see our two-click reproduction matrix for MS MARCO V1 doc. The document ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference (although see note below for more details).

Corpus Download

We're going to use the Pyserini repository's root directory as the working directory. First, we need to download and unpack the corpus:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/
tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/

To confirm, msmarco-doc-segmented-unicoil.tar is 19 GB and has MD5 checksum 6a00e2c0c375cb1e52c83ae5ac377ebb.

Indexing

We can now index these docs:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco-doc-segmented-unicoil/ \
  --index indexes/lucene-index.msmarco-doc-segmented-unicoil/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 12 \
  --impact --pretokenized

The important indexing options to note here are --impact --pretokenized: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.

The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around an hour.

Retrieval

We can now run retrieval:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-doc-segmented-unicoil \
  --topics msmarco-doc-dev \
  --encoder castorini/unicoil-msmarco-passage \
  --output runs/run.msmarco-doc-segmented-unicoil.tsv \
  --output-format msmarco \
  --batch 36 --threads 12 \
  --hits 1000 --max-passage --max-passage-hits 100 \
  --impact

Here, we are using the transformer model to encode the queries on the fly using the CPU. Note that the important option here is --impact, where we specify impact scoring. With these impact scores, query evaluation is already slower than bag-of-words BM25; on top of that we're adding neural inference on the CPU. A complete run can take around 40 minutes.

The output is in MS MARCO output format, so we can directly evaluate:

$ python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev \
    --run runs/run.msmarco-doc-segmented-unicoil.tsv

#####################
MRR @100: 0.3530641289682811
QueriesRanked: 5193
#####################

There might be small differences in score due to non-determinism in neural inference; see these notes for details. The above score was obtained on Linux.

Alternatively, we can use pre-tokenized queries with pre-computed weights, which are already included in Pyserini. We can run retrieval as follows:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-doc-segmented-unicoil \
  --topics msmarco-doc-dev-unicoil \
  --output runs/run.msmarco-doc-segmented-unicoil.tsv \
  --output-format msmarco \
  --batch 36 --threads 12 \
  --hits 1000 --max-passage --max-passage-hits 100 \
  --impact

Here, we also specify --impact for impact scoring. Since we're not applying neural inference over the queries, speed is faster, typically less than 10 minutes.

The output is in MS MARCO output format, so we can directly evaluate:

$ python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev \
    --run runs/run.msmarco-doc-segmented-unicoil.tsv

#####################
MRR @100: 0.352997702662614
QueriesRanked: 5193
#####################

Note that in this case, the results should be deterministic.

A final detail: with MaxP and the need to generate runs to different depths, we can set --hits and --max-passage-hits differently. Due to tie-breaking effects, we get slightly different results with different settings: see Anserini experiments for additional details. Because of slightly different parameter settings, the results here do not exactly match the results in the two-click reproduction matrix for MS MARCO V1 doc.

Reproduction Log*