Skip to content

Liu-Hy/nlp-contrib-graph

Repository files navigation

NLP Contribution Graph

This repo contains data and code to solve SemEval-2021 Task 11: NLP Contribution Graph
For detailed description of our method, please see the paper "UIUC_BioNLP at SemEval-2021 Task 11: A Cascade of Neural Models for Structuring Scholarly NLP Contributions".

Dependencies

  • This repo requires simpletransformers/ - the customized Simple Transformers package
    • With customized model for subtask 1 to incorporate additional features
    • Extended from Simple Transformers version 0.51.10, compatible with common usage
    • Please first install the common package by running this code:
      • pip install simpletransformers==0.51.10
        find the installation directory, and replace the simpletransformers folder with this folder

Data

  • training_data/ - the training data merged with the trial data, with full annotation.
  • interim/ - intermediate data files converted from the training data
    • all_sent.csv - contains all the sentences, each with its section header, positional features, paper topic and index, BIO tags, etc.
    • pos_sent.csv - a subset of all_sent.csv consisting of all the positive sentences.
    • triples.csv - contains each positive sentence with the predicates and terms in it, and the corresponding triples of different types.
  • test_data/ - the test data, with sentence and phrase annotation released.

Scripts

  • pre.py - preprocess training data, report potential errors, produce all_sent.csv and pos_sent.csv

  • ext.py - preprocess training data, and produce triples.csv

  • train_sent/ - Note that all scripts in this folder require the customized Simple Transformers package.

    • A binary classifier is trained for subtask 1: contribution sentence classification
    • A multi-class classifier is trained to classify sentences into information units
    • A filename ended with '_ens' indicates that submodels are trained for ensembling.
  • train_ner/ - The models are trained for subtask 2: key phrase extraction.

    • In the 'specific_bio' scheme, we use specific BIO tags to indicate phrase types, and train an NER model directly.
    • In the 'simple_bio' scheme, we first identify the phrases, and then classify them into predicates and terms. The script for ensembling the models are also provided.
  • train_rel/ - For subtask 3: triple extraction, four models are trained to extract triples of type A, B, C and D respectively.

    • For type A triples, two schemes are implemented: pairwise classification and direct triple classification. Only the latter scheme is used in evaluation phases.
  • predict1/ - scripts for Evaluation Phase 1 (end-to-end evaluation). Run the scripts in this order:

    • pre.py - test data preprocessing
    • sent_binary.py - contribution sentence classification
    • sent_multi.py - information unit classification
    • ner.py - phrases extraction. The 'specific-bio' scheme was used in this phase.
    • predict_triples.py - extraction of type A, B, C and D triples, using different models.
    • submit.ipynb - output formatting for submission
  • predict2/ - scripts for Evaluation Phase 2 Part 1: given the contribution sentence labels, do the rest.

    • The naming of scripts basically follows that in predict1/.
    • A filename ended with '-ens' indicates that an ensemble of submodels is used for prediction.
    • In this phase and later, we used the 'simple-bio' scheme for phrase extraction.
  • predict3/ - scripts for Evaluation Phase 2 Part 2: given the labels of contribution sentences and phrases, do the rest.

    • We copied the result of information unit classification in predict2/. Thus after running pre.py, we directly started from phrase classification.

Useful Links