Skip to content

amirveyseh/AAAI-21-SDU-shared-task-1-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SDU@AAAI-21 - Shared Task 1: Acronym Identification

This repository contains the acronym identification training and development set along with the evaluation scripts for the acronym identification task at SDU@AAAI-21.

Dataset

The dataset folder contains three files:

  • train.json: The training samples for acronym identification task. Each sample has three attributes:
    • tokens: The list of words (tokens) of the sample
    • labels: The short-form and long-form labels of the words in BIO format. The labels B-short and B-long identifies the beginning of a short-form and long-form phrase, respectively. The labels I-short and I-long indicates the words inside the short-form or long-form phrases. Finally, the label O shows the word is not part of any short-form or long-form phrase.
    • id: The unique ID of the sample
  • dev.json: The development set for acronym identification task. The samples in dev.json have the same attributes as the samples in train.json.
  • predictions.json: A sample prediction file created from dev.json to test the scoring script. The participants should submit the final test predictions of their model in the same format as the predictions.json file. Each prediction should have two attributes:
    • id: The ID of the sample (i.e., the same IDs used in the train/dev/test samples provided in train.json, dev.json and test.json)
    • predictions: The labels of the words of the sample in BIO format. The labels B-short and B-long identifies the beginning of a short-form and long-form phrase, respectively. The labels I-short and I-long indicates the words inside the short-form or long-form phrases. Finally, the label O shows the word is not part of any short-form or long-form phrase.

Code

In order to familiarize the participants with this task, we provide a rule-based model in the code directory. This baseline implements the method proposed by Schwartz and Hearst [1]. To identify acronyms, if more than 60% of the characters of a word are uppercased, this model recognizes it as acronym (i.e., short-form). To identify the long-form, it compares the characters of the acronym with the characters of the words that are before or after the acronym up to a certain window size. If the characters of these words could form the acronym, they are labeled as long-form. To run this model, use the following command:

python code/character_match.py -input <path/to/input.json> -output <path/to/output.json>

Please replace the <path/to/input.json> and <path/to/output.json> with the real paths to the input file (e..g, dataset/dev.json) and output file. The output file contains the predictions and can be evaluated by the scorer using the command described in the next section. The official scores for this baseline are: Precision: 93.22%, Recall: 78.90%, F1: 85.46%

Evaluation

To evaluate the predictions (in the format provided in dataset/predictions.json file), run the following command:

python scorer.py -g path/to/gold.json -p path/to/predictions.json

The path/to/gold.json and path/to/predictions.json should be replaced with the real paths to the gold file (e.g., dataset/dev.json for evaluation on development set) and predictions file (i.e., the predictions generated by your system in the same format as dataset/predictions.json file). The official evaluation metrics are the macro-averaged precision, recall and F1 for short form and long form predictions. For verbose evaluation (including the micro-averaged precision, recall and F1 and also short form and long form scores seperatedly), use the following command:

python scorer.py -g path/to/gold.json -p path/to/predictions.json -v

Participation

In order to participate, please first fill out this form to register for the shared tasks: https://forms.gle/NvnT549mSbyeJQAPA. The team name that is provided in this form will be used in the subsequent submissions and communications. The shared task is organized in two separate phases:

  • Development Phase: In this phase, the participants will use the training/development sets provided in this repository to design and develop their models.
  • Evaluation Phase: Two weeks before the system runs due, i.e., 20th November 2020, the test set is released here. The test set has the same distribution and format as the development set. Run your model on the provided test set and save the prediction results in a Json file with the same format as the "predictions.json" file. Name the prediction file as "output.json" and send that to the email address sdu-aaai21@googlegroups.com with title "Results of AI-[TEAM-name]-[RUN-ID]", where "[TEAM-name]" should be replaced with the name of your team provided in the registration form and "[RUN-ID]" with a number between 1 to 10 to identify the model run. Each participant team is allowed to submit up to 10 different model runs. Note that your official score is reported for the model run with ID 1. In addition to the "output.json" file, please include the following information in your email:
    • Model Description: A brief summary of the model architecture. If your model is using word embedding, please specify what type of word embedding your model is using.
    • Extra Data: Whether or not the model employs other resources/data, e.g., acronym glossaries, in the development or evaluation phases.
    • Training/Evaluation Time: How long the model takes to be trained/evaluated on the provided dataset
    • Run Description: A brief description on what is the difference in the recent model run compared to other runs (if it is applicable)
    • Plan for System Report: If you have any plan to submit your system report or release your model publicly, please specify that. Participants are strongly encouraged to submit a system report, regardless of the results.

For more information, see SDU@AAAI-21.

Update: The CodaLab competitions for the shared task is open. Participants can also submit their results to Acronym Identification competition. For more information, please check the CodaLab competition for Acronym Identification.

Citation

If you use the dataset, baseline or evaluation script released in this repo, please cite our paper:

@inproceedings{veyseh-et-al-2020-what,
   title={{What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation}},
   author={Amir Pouran Ben Veyseh and Franck Dernoncourt and Quan Hung Tran and Thien Huu Nguyen},
   year={2020},
   booktitle={Proceedings of COLING},
   link={https://arxiv.org/pdf/2010.14678v1.pdf}
}

Licenses

The dataset provided for this shared task is licensed under CC BY-NC-SA 4.0 international license, and the evaluation script and the baseline are licensed under MIT license.

References

[1] Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput. 2003:451-62. PMID: 12603049.