Ag-valuate - A Test Collection for Agricultural Information Needs

Overview

Ag-valuate is a new test collection for both passage and document retrieval in the Agriculture domain.

Two sources of agricultural information were obtained as part of the collection: 4,003 agricultural reports from the Grains Research Development Corporation and State Departments of Agriculture in Australia; and 82,843 scientific journal and conference articles from 33 agricultural journals.

These selected reports and journal articles were considered relevant to the grains industry and focused on crop agronomy and soils. The targeted subject matter related to the growth and management of grains crops including cereals (e.g. wheat, barley, and sorghum), legumes (e.g. chickpea, soybean, mungbean), and oilseeds (e.g. canola), and the management of the soils on which these crops are grown. Topics covered included recommendations and research relevant to the management of individual crops through varietals selection, sowing times, planting rates and row spacing etc; whole farming system performance, crop sequencing and fallow management practices; fertiliser management; and the identification and management of pest and diseases that affected the grains industry. Both these sources came in the form of PDF documents.

Ag-valuate provides a rich resource with a wide variety of uses: passage or document retrieval, query variation, answer generation, scientific document extraction, and domain specific or expert search. To demonstrate the utility of Ag-valuate, we conducted experiments for two of these tasks, passage retrieval and query variation, using state-of-the-art neural rankers, reporting the effectiveness and providing the code with the collection.

Documents and Passages

The pre-processed GRDC JSON reports are provided freely in the repository. The raw PDF files of the GRDC reports can be downloaded from here. The journal articles come from subscription journals so cannot be redistributed. However, we provide crawler scripts that can be used to download the full collection using a public API for the reports and an institutional or paid subscription to these journals. The Document Crawler includes details of how to run the following document crawlers:

Once full-text PDFs were obtained, they were converted from PDF to JSON using Apache Tika. From here, the documents were further split into passages of three sentences (the Spacy sentencizer was used to derive sentence boundaries and [Code].) From the 86,846 documents, 9,441,693 passages were produced.

Question/Query Topics

A total of 210 topics were created from 165 documents (multiple, different topics could sometimes be derived from a single document). Topics were divided into training and test sets. The 50 topics with the most relevance assessments formed the test set and the remaining 160 topics formed the training set. (Other splits can be done as desired; ours was purely done for our experiments.) Each topic contained multiple query variations, a natural language question and an expert-authored answer, thus providing a rich representation of the information need. Relevance assessment by two agricultural experts produced 3,948 question-passage judged pairs.

Pooling and Relevance Assessment

Using the 210 topics we set out to form a high quality pool for relevance assessment. We considered two state-of-the-art neural ranking systems:

monoBERT Reranker, involving a first stage BM25 initial retrieval of 1000 documents, followed by a fine-tuned monoBERT reranker without interpolating with the BM25. We used a monoBERT model pre-trained on the MSMARCO dataset and then fine-tuned on the 160 training topics.
TILDEv2 Tuned, is a neural reranker that utilises document expansion at indexing time to avoid the need for neural encoding of query or document at query time. It involved a first stage BM25 retrieval of 1000 documents, followed by a fine-tuned TILDEv2 reranker. TILDEv2 was added as a computationally efficient --- yet still effective --- model that might be deployed in a live search system. This model was also fine-tuned on the 160 training topics.

Runs for all 210 topics were produced for each of the two systems above. These runs were fused using reciprocal rank fusion to produce the final pool for human assessment. [Code]

Relevance assessment was conducted by authors D.Lawrence and Y.Dang, both agricultural scientists. Each was presented with the topic question, a list of passages for judging, along with a link to the PDF source document from which the passage was extracted. Grades of relevance were: relevant, marginal and non-relevant. The criterion for relevance given to assessors was: does the passage help to answer the question, where relevant meant that the passage contained the answer, marginal meant the passage contained some part but not the whole answer, and non-relevant meant the passage contained no useful information.

For the topics from the test set, assessors judged in order until rank 20; if no relevant passage was found in the top 20 then they continued down the ranking until a relevant passage was found or rank 100 was reached.

For the topics from the training set, assessors judged the top 10 passages, regardless of relevance. Topics obtained via the known-item retrieval process will have at least 1 relevant passage.

Passage Retrieval Baselines

We implemented the following retrieval models and evaluated them on the Ag-valuate test collection:

BM25, Vanilla BM25 baseline to understand how a simple term-based retrieval performs.
BM25-Tuned-RM3, A BM25 with params b and k1 tuned on the training set and pseudo relevance feedback using RM3.
monoBERT Reranker, BM25 followed by a monoBERT reranker pre-trained on MSMARCO and fine-tuned on the 160 training topics. (The same system used for pooling.)
TILDEv2 Tuned, The same computationally efficient neural document expansion model used for pooling.
TILDEv2, TILDEv2 without fine-tuning on the target domain, providing an estimate of the benefit of performing fine-tuning.
ANCE, is a dense retriever that selects more realistic negative training instances from an Approximate Nearest Neighbor (ANN) index of the corpus. We used a ANCE model pre-trained on the MSMARCO dataset.

To make use of the multi-faceted topics provided by Ag-valuate, we ran the above models using both the natural language questions and keyword query versions of the topic. This aimed to uncover some insights into how query variation impact effectiveness.

Shield:

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.idea		.idea
code		code
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

code

code

data

data

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Ag-valuate - A Test Collection for Agricultural Information Needs

Table of Contents

Overview

Documents and Passages

Question/Query Topics

Pooling and Relevance Assessment

Passage Retrieval Baselines

About

Releases

Packages

Contributors 3

Languages

License

ielab/agvaluate

Folders and files

Latest commit

History

Repository files navigation

Ag-valuate - A Test Collection for Agricultural Information Needs

Table of Contents

Overview

Documents and Passages

Question/Query Topics

Pooling and Relevance Assessment

Passage Retrieval Baselines

About

Resources

License

Stars

Watchers

Forks

Languages