Skip to content

fair-trec/fair-trec-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fair TREC Tools

Public tools for working with the Fair TREC data.

Environment & Compilation

The provided Conda environment spec will install all required dependencies, except for the AWS CLI tools needed for downloading the Open Corpus:

conda env create -f environment.yml
conda activate fairtrec

High-throughput data processing tools, such as the subsetter, are implemented in Rust (installed in the Conda environment); to build, run:

cargo build --release

Downloading Data

You need two sets of data:

  1. The released data files from the Fair TREC web site, stored in data/ai2-trec-release

  2. The Open Corpus, downloaded to data/corpus with:

    aws s3 cp --no-sign-request --recursive s3://ai2-s2-research-public/open-corpus/2020-05-27/ data/corpus
    

Subsetting the Corpus

To re-generate the OpenCorpus subet containing all files in the paper metadata file, run:

./target/release/subset-corpus -M data/ai2-trec-release/paper_metadata.csv \
    -o data/corpus-subset-for-meta.gz data/corpus

To generate a subset based on the candidate sets from query records, run:

./target/release/subset-corpus -Q data/TREC-Competition-training-sample.json \
    -o data/corpus-subset-for-queries.jsonl.gz data/corpus

The subset command also produces metadata CSV alongside the compressed JSON output.

The --help option works and will produce usage help.

About

Tools for working with the Fair TREC data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published