Pyserini: BM25 Baseline for MS MARCO Passage Ranking

This guide contains instructions for running a BM25 baseline on the MS MARCO passage ranking task, which is nearly identical to a similar guide in Anserini, except that everything is in Python here (no Java). Note that there is a separate guide for the MS MARCO document ranking task. This exercise will require a machine with >8 GB RAM and >15 GB free disk space.

If you're a Waterloo student traversing the onboarding path (which starts here), make sure you've already done the BM25 Baselines for MS MARCO Passage Ranking in Anserini. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.

Learning outcomes for this guide, building on previous steps in the onboarding path:

Be able to use Pyserini to build a Lucene inverted index on the MS MARCO passage collection.
Be able to use Pyserini to perform a batch retrieval run on the MS MARCO passage collection with the dev queries.
Be able to evaluate the retrieved results above.
Be able to generate the retrieved results above interactively by directly manipulating Pyserini Python classes.

In short, you'll do everything you did with Anserini (in Java) on the MS MARCO passage ranking test collection, but now with Pyserini (in Python).

What's Pyserini? Well, it's the repo that you're in right now. Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. The toolkit provides Python bindings for our group's Anserini IR toolkit, which is built on Lucene (in Java). Pyserini provides entrée into the broader deep learning ecosystem, which is heavily Python-centric.

Data Prep

The guide requires the development installation. So get your Python environment set up.

Once you've done that: congratulations, you've passed the most difficult part! Everything else below mirrors what you did in Anserini (in Java), so it should be easy.

We're going to use collections/msmarco-passage/ as the working directory. First, we need to download and extract the MS MARCO passage dataset:

mkdir collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

To confirm, collectionandqueries.tar.gz should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

Next, we need to convert the MS MARCO tsv collection into Pyserini's jsonl files (which have one json object per line):

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl, each with 1M lines (except for the last one, which should have 841,823 lines).

Indexing

We can now index these documents as a JsonCollection using Pyserini:

python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input collections/msmarco-passage/collection_jsonl \
  --index indexes/lucene-index-msmarco-passage \
  --generator DefaultLuceneDocumentGenerator \
  --threads 9 \
  --storePositions --storeDocvectors --storeRaw

The command-line invocation should look familiar: it essentially mirrors the command with Anserini (in Java). If you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.

Upon completion, you should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.

Retrieval

The 6980 queries in the development set are already stored in the repo. Let's take a peek:

$ head tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
1048585	what is paula deen's brother
2	 Androgen receptor define
524332	treating tension headaches without medication
1048642	what is paranoid sc
524447	treatment of varicose veins in legs
786674	what is prime rate in canada
1048876	who plays young dr mallard on ncis
1048917	what is operating system misconfiguration
786786	what is priority pass
524699	tricare service number

$ wc tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
    6980   48335  290193 tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt

Each line contains a tab-delimited (query id, query) pair. Conveniently, Pyserini already knows how to load and iterate through these pairs. We can now perform retrieval using these queries:

python -m pyserini.search.lucene \
  --index indexes/lucene-index-msmarco-passage \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.txt \
  --output-format msmarco \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68 \
  --threads 4 --batch-size 16

Here, we set the BM25 parameters to k1=0.82, b=0.68 (tuned by grid search). The option --output-format msmarco says to generate output in the MS MARCO output format. The option --hits specifies the number of documents to return per query. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.

Once again, if you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.

Retrieval speed will vary by hardware: On a reasonably modern CPU with an SSD, we might get around 13 qps (queries per second), and so the entire run should finish in under ten minutes (using a single thread). We can perform multi-threaded retrieval by using the --threads and --batch-size arguments. For example, setting --threads 16 --batch-size 64 on a CPU with sufficient cores, the entire run will finish in a couple of minutes.

Evaluation

After the run finishes, we can evaluate the results using the official MS MARCO evaluation script, which has been incorporated into Pyserini:

$ python -m pyserini.eval.msmarco_passage_eval \
   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
   runs/run.msmarco-passage.bm25tuned.txt

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

We can also use the official TREC evaluation tool, trec_eval, to compute metrics other than MRR@10.

The tool needs a different run format, so it's easier to just run retrieval again:

python -m pyserini.search.lucene \
  --index indexes/lucene-index-msmarco-passage \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.trec \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68 \
  --threads 4 --batch-size 16

The only difference here is that we've removed --output-format msmarco.

Then, convert qrels files to the TREC format:

python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
  --input collections/msmarco-passage/qrels.dev.small.tsv \
  --output collections/msmarco-passage/qrels.dev.small.trec

Finally, run the trec_eval tool, which has been incorporated into Pyserini:

$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap \
   collections/msmarco-passage/qrels.dev.small.trec \
   runs/run.msmarco-passage.bm25tuned.trec

map                   	all	0.1957
recall_1000           	all	0.8573

If you want to examine the MRR@10 for qid 1048585:

$ python -m pyserini.eval.trec_eval -q -c -M 10 -m recip_rank \
    collections/msmarco-passage/qrels.dev.small.trec \
    runs/run.msmarco-passage.bm25tuned.trec | grep 1048585

recip_rank            	1048585	1.0000

Once again, if you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.

Otherwise, congratulations! You've done everything that you did in Anserini (in Java), but now in Pyserini (in Python).

Interactive Retrieval

There's one final thing we should go over. Because we're in Python now, we get the benefit of having an interactive shell. Thus, we can run Pyserini interactively.

Try the following:

from pyserini.search.lucene import LuceneSearcher

searcher = LuceneSearcher('indexes/lucene-index-msmarco-passage')
searcher.set_bm25(0.82, 0.68)
hits = searcher.search('what is paula deen\'s brother')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')

The LuceneSearcher class provides search capabilities for BM25. In the code snippet above, we're issuing the query about Paula Deen's brother (from above). Note that we're explicitly setting the BM25 parameters, which are not the default parameters. We get back a list of results (hits), which we then iterate through and print out:

 1 7187158 18.811600
 2 7187157 18.333401
 3 7187163 17.878799
 4 7546327 16.962099
 5 7187160 16.564699
 6 8227279 16.432501
 7 7617404 16.239901
 8 7187156 16.024900
 9 2298838 15.701500
10 7187155 15.513300

You can confirm that the output is the same as pyserini.search.lucene from above.

$ grep 1048585 runs/run.msmarco-passage.bm25tuned.trec | head -10
1048585 Q0 7187158 1 18.811600 Anserini
1048585 Q0 7187157 2 18.333401 Anserini
1048585 Q0 7187163 3 17.878799 Anserini
1048585 Q0 7546327 4 16.962099 Anserini
1048585 Q0 7187160 5 16.564699 Anserini
1048585 Q0 8227279 6 16.432501 Anserini
1048585 Q0 7617404 7 16.239901 Anserini
1048585 Q0 7187156 8 16.024900 Anserini
1048585 Q0 2298838 9 15.701500 Anserini
1048585 Q0 7187155 10 15.513300 Anserini

To pull up the actual contents of a hit:

hits[0].lucene_document.get('raw')

And you should get:

'{\n  "id" : "7187158",\n  "contents" : "Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba\'sâ\x80¦ Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba\'sâ\x80¦"\n}'

Everything make sense? If so, now you're truly done with this guide and are ready to move on and learn about the relationship between sparse and dense retrieval!

Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.

Reproduction Log*

Results reproduced by @JeffreyCA on 2020-09-14 (commit 49fd7cb)
Results reproduced by @jhuang265 on 2020-09-14 (commit 2ed2acc)
Results reproduced by @Dahlia-Chehata on 2020-11-11 (commit 8172015)
Results reproduced by @rakeeb123 on 2020-12-07 (commit 3bcd4e5)
Results reproduced by @jrzhang12 on 2021-01-03 (commit 7caedfc)
Results reproduced by @HEC2018 on 2021-01-04 (commit 46a6d47)
Results reproduced by @KaiSun314 on 2021-01-08 (commit aeec31f)
Results reproduced by @yemiliey on 2021-01-18 (commit 98f3236)
Results reproduced by @larryli1999 on 2021-01-22 (commit 74a87e4)
Results reproduced by @ArthurChen189 on 2021-04-08 (commit 7261223)
Results reproduced by @printfCalvin on 2021-04-12 (commit 0801f7f)
Results reproduced by @saileshnankani on 2021-04-26 (commit 6d48609)
Results reproduced by @andrewyguo on 2021-04-30 (commit ecfed61)
Results reproduced by @mayankanand007 on 2021-05-04 (commit a9d6f66)
Results reproduced by @rootofallevii on 2021-05-14 (commit e764797)
Results reproduced by @jpark621 on 2021-06-13 (commit f614111)
Results reproduced by @nimasadri11 on 2021-06-28 (commit d31e2e6)
Results reproduced by @mzzchy on 2021-07-05 (commit 45083f5)
Results reproduced by @d1shs0ap on 2021-07-16 (commit a6b6545)
Results reproduced by @apokali on 2021-08-19 (commit45a2fb4)
Results reproduced by @leungjch on 2021-09-12 (commit c71a69e)
Results reproduced by @AlexWang000 on 2021-10-10 (commit 8599c81)
Results reproduced by @manveertamber on 2021-12-05 (commit c280dad)
Results reproduced by @lingwei-gu on 2021-12-15 (commit 7249409)
Results reproduced by @tyao-t on 2021-12-19 (commit fc54ed6)
Results reproduced by @kevin-wangg on 2022-01-05 (commit b9fcae7)
Results reproduced by @vivianliu0 on 2021-01-06 (commit 937ec63)
Results reproduced by @mikhail-tsir on 2022-01-10 (commit f1084a0)
Results reproduced by @AceZhan on 2022-01-14 (commit 68be809)
Results reproduced by @jh8liang on 2022-02-06 (commit e03e068)
Results reproduced by @HAKSOAT on 2022-03-10 (commit 7796685)
Results reproduced by @jasper-xian on 2022-03-27 (commit 5668edd)
Results reproduced by @jx3yang on 2022-04-25 (commit 53333e0)
Results reproduced by @alvind1 on 2022-05-04 (commit 244828f)
Results reproduced by @Pie31415 on 2022-06-20 (commit 52db3a7)
Results reproduced by @aivan6842 on 2022-07-11 (commit f553d43)
Results reproduced by @Jasonwu-0803 on 2022-09-27 (commit 563e4e7)
Results reproduced by @limelody on 2022-09-27 (commit 7b53918)
Results reproduced by @minconszhang on 2022-11-25 (commit a3b0631)
Results reproduced by @jingliu on 2022-12-08 (commit f5a73f0)
Results reproduced by @farazkh80 on 2022-12-18 (commit 3d8c473)
Results reproduced by @Cath on 2023-01-14 (commit ec37c5e)
Results reproduced by @dlrudwo1269 on 2023-03-08 (commit dfae4bb5)
Results reproduced by @aryamancodes on 2023-04-11 (commit 1aea2b0)
Results reproduced by @Jocn2020 on 2023-05-01 (commit ca5a2be)
Results reproduced by @zoehahaha on 2023-05-12 (commit 68be809)
Results reproduced by @Richard5678 on 2023-06-13 (commit ccb6df5)
Results reproduced by @pratyushpal on 2023-07-14 (commit 760c22a)
Results reproduced by @sahel-sh on 2023-07-22 (commit 863ff361)
Results reproduced by @yilinjz on 2023-08-25 (commit b57b583)
Results reproduced by @Andrwyl on 2023-08-26 (commit 0b3ec90)
Results reproduced by @UShivani3 on 2023-08-29 (commit d9da49e)
Results reproduced by @Edward-J-Xu on 2023-09-04 (commit 8063322)
Results reproduced by @mchlp on 2023-09-07 (commit d8dc5b3)
Results reproduced by @lucedes27 on 2023-09-10 (commit 54014af)
Results reproduced by @MojTabaa4 on 2023-09-14 (commit d4a829d)
Results reproduced by @Kshama on 2023-09-24 (commit 7d18f4b)
Results reproduced by @MelvinMo on 2023-09-24 (commit 7d18f4b)
Results reproduced by @ksunisth on 2023-09-27 (commit 142c774)
Results reproduced by @maizerrr on 2023-10-01 (commit bdb9504)
Results reproduced by @Stefan824 on 2023-10-04 (commit 4f3da10)
Results reproduced by @shayanbali on 2023-10-13 (commit f889bc4)
Results reproduced by @gituserbs on 2023-10-18 (commit f1d623c)
Results reproduced by @shakibaam on 2023-11-04 (commit 01889cc)
Results reproduced by @gitHubAndyLee2020 on 2023-11-05 (commit 01889cc)
Results reproduced by @Melissa1412 on 2023-11-05 (commit acd969f)
Results reproduced by @aliranjbari on 2023-11-08 (commit 12cbb11)
Results reproduced by @salinaria on 2023-11-11 (commit 086e16b)
Results reproduced by @oscarbelda86 on 2023-11-13 (commit 086e16b)
Results reproduced by @Seun-Ajayi on 2023-11-13 (commit 086e16b)
Results reproduced by @AndreSlavescu on 2023-11-28 (commit 1219cdb)
Results reproduced by @tudou0002 on 2023-11-28 (commit 723e06c)
Results reproduced by @golnooshasefi on 2023-11-28 (commit 1219cdb)
Results reproduced by @alimt1992 on 2023-11-29 (commit e6700f6)
Results reproduced by @sueszli on 2023-12-01 (commit 170e271)
Results reproduced by @kdricci on 2023-12-01 (commit a2049c4)
Results reproduced by @ljk423 on 2023-12-04 (commit 35002ad)
Results reproduced by @saharsamr on 2023-12-14 (commit 039c137)
Results reproduced by @Panizghi on 2023-12-17 (commit 0f5db95)
Results reproduced by @AreelKhan on 2023-12-22 (commit f75adca)
Results reproduced by @wu-ming233 on 2023-12-31 (commit 38a571f)
Results reproduced by @Yuan-Hou on 2024-01-02 (commit 38a571f)
Results reproduced by @himasheth on 2024-01-10 (commit a6ed27e)
Results reproduced by @Tanngent on 2024-01-13 (commit 57a00cf)
Results reproduced by @BeginningGradeMaker on 2024-01-15 (commit d4ea011)
Results reproduced by @ia03 on 2024-01-18 (commit 05ee8ef)
Results reproduced by @AlexStan0 on 2024-01-20 (commit 833ee19)
Results reproduced by @charlie-liuu on 2024-01-23 (commit 87a120e)
Results reproduced by @dannychn11 on 2024-01-28 (commit 2f7702f)
Results reproduced by @ru5h16h on 2024-02-19 (commit 758eaaa)
Results reproduced by @ASChampOmega on 2024-02-23 (commit 442e7e1)
Results reproduced by @16BitNarwhal on 2024-02-26 (commit 19fcd3b)
Results reproduced by @HaeriAmin on 2024-02-27 (commit 19fcd3b)
Results reproduced by @17Melissa on 2024-03-03 (commit a9f295f)
Results reproduced by @devesh-002 on 2024-03-05 (commit 84c6742)
Results reproduced by @chloeqxq on 2024-03-07 (commit 19fcd3b)
Results reproduced by @xpbowler on 2024-03-11 (commit 19fcd3b)
Results reproduced by @jodyz0203 on 2024-03-12 (commit 280e009)
Results reproduced by @kxwtan on 2024-03-12 (commit 2bb342a)
Results reproduced by @syedhuq28 on 2024-03-28 (commit 2bb342a)
Results reproduced by @khufia on 2024-03-26 (commit 2bb342a)
Results reproduced by @Lindaaa8 on 2024-03-29 (commit 7dda9f3)
Results reproduced by @th13nd4n0 on 2024-04-05 (commit df3bc6c)
Results reproduced by @a68lin on 2024-04-12 (commit 7dda9f3)
Results reproduced by @DanielKohn1208 on 2024-04-22 (commit 184a212)
Results reproduced by @emadahmed19 on 2024-04-28 (commit 9db2584)
Results reproduced by @CheranMahalingam on 2024-05-05 (commit f817186)
Results reproduced by @billycz8 on 2024-05-08 (commit c945c50)
Results reproduced by @KenWuqianhao on 2024-05-11 (commit c945c50)
Results reproduced by @hrouzegar on 2024-05-13 (commit bf68fc5)
Results reproduced by @Yuv-sue1005 on 2024-05-14 (commit '9df4015')
Results reproduced by @RohanNankani on 2024-05-17 (commit a91ef1d)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-msmarco-passage.md

experiments-msmarco-passage.md

Pyserini: BM25 Baseline for MS MARCO Passage Ranking

Data Prep

Indexing

Retrieval

Evaluation

Interactive Retrieval

Reproduction Log*

Files

experiments-msmarco-passage.md

Latest commit

History

experiments-msmarco-passage.md

File metadata and controls

Pyserini: BM25 Baseline for MS MARCO Passage Ranking

Data Prep

Indexing

Retrieval

Evaluation

Interactive Retrieval

Reproduction Log*