Skip to content
This repository has been archived by the owner on Aug 12, 2019. It is now read-only.

CrossRef/reference-matching-evaluation

Repository files navigation

Evaluation framework for reference matching algorithms

This code allows for:

  • generating artificial datasets for reference matching
  • applying reference matching approaches
  • evaluating reference matching approaches

Before running the scripts, add the code directory path to the environment variable PYTHONPATH.

Polite/Plus support

Most scripts call Crossref REST API under the hood. To use the polite pool or Plus Service when calling the API, simply copy the file .crapi_key_example to ~/.crapi_key, and fill in the settings in the file.

Dataset generation

A dataset is generated in several steps:

Sample extraction

Drawing the sample of documents from the system.

dataset/draw_sample.py [-h] [-v] -s SIZE [-f FILTER] [-q QUERY] -o OUTPUT

Arguments:

  • -h print help
  • -v verbose output
  • -s sample size
  • -f filter to apply before drawing the sample
  • -q query to apply before drawing the sample
  • -o output file path

Sample extension

Extending the sample by adding similar documents.

dataset/extend_sample.py [-h] [-v] -s SAMPLE -e EXTEND -o OUTPUT

Arguments:

  • -h print help
  • -v verbose output
  • -s sample file path
  • -e number of similar documets to add for each sample document
  • -o output file path

Generating the dataset

Generating reference strings and the final dataset.

dataset/generate_dataset.py [-h] [-v] [-l STYLES] [-a ATTRIBUTES] [-n] -s SAMPLE -o OUTPUT

Arguments:

  • -h print help
  • -v verbose output
  • -l comma-separated list of citation styles
  • -a comma-separated list of document attributes to keep (useful if we want to calculate the results along the values of those attributes)
  • -n set the target DOIs to null
  • -s sample file path
  • -o output file path

Reference matching

Match the references from the dataset to target DOIs.

matching/match_references.py [-h] [-v] -d DATASET -o OUTPUT

Arguments:

  • -h print help
  • -v verbose output
  • -d dataset file path
  • -o output file path

The matching algorithm is set by specifying the matcher object in matching/match_config.py. The matcher object should implement two methods:

  • description(): returns a string
  • match(reference_string): returns a tuple (matched DOI, matching score)

Evaluation

Evaluate the reference matching. The results are printed to stdout.

evaluation/evaluate.py [-h] [-v] -d DATASET [-s SPLITATTR] [-o OUTPUTDOC] [-p OUTPUTSPLIT]

Arguments:

  • -h print help
  • -v verbose output
  • -d dataset file path (matched)
  • -s attribute along which the results should be split
  • -o output file path for document-level results
  • -p output file path for split attribute results

Additional scripts

Datasets merging

Merge two or more samples or references datasets into a single file.

dataset/merge_datasets.py [-h] [-v] -d DATASETS -o OUTPUT

Arguments:

  • -h print help
  • -v verbose output
  • -d comma-separated list of dataset file paths
  • -o output file path

Links export

Export the information about a sample of existing and missing links in the current system.

dataset/export_existing_and_new_links.py [-h] [-v] -s SAMPLE [-n REFERENCES] [-p] -o OUTPUT

Arguments:

  • -h print help
  • -v verbose output
  • -s sample file path
  • -n max number of output references
  • -p export parsed references instead of strings
  • -o output file path