Skip to content

norikinishida/biomedical-discourse-treebanks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Biomedical Discourse Treebanks

This repository contains biomedical discourse treebanks:

  • COVID19-DTB (Nishida and Matsumoto, 2021; to appear in TACL)

COVID19-DTB: COVID-19 Discourse Dependency Treebank

COVID19-DTB is a collection of manually annotated discourse dependency structures for scholarly paper abstracts on COVID-19 and related coronaviruses like SARS and MERS.

The following figure shows an actual example of discourse dependency structure for an abstract (Israeli et al., 2020) in COVID19-DTB.

Description

First, we sampled 300 abstracts randomly from the 2020 September snapshot of The COVID-19 Open Research Dataset (CORD-19). Then, the 300 abstracts were segmented into Elementary Discourse Units (EDUs) manually by the authors. Then, we employed two professional annotators to give gold discourse dependency structures to the 300 abstracts independently. We divided the results into development and test sets, each of which consists of 150 examples.

The further details on the annotation scheme, annotation procedure, and corpus statistics can be found in our paper.

Data format

We follow the same JSON schema with the SciDTB dataset (Yang and Li, 2018): Each discourse dependency structure is stored as a JSON file like the following:

{
    "root": [
        {
            "id": 0,
            "parent": -1,
            "text": "ROOT",
            "relation": "null"
        },
        {
            "id": 1,
            "parent": 3,
            "text": "SARS - CoV-2 genetic identification is based on viral RNA extraction prior to RT - qPCR assay ,",
            "relation": "BACKGROUND"
        },
        {
            "id": 2,
            "parent": 1,
            "text": "however recent studies support the elimination of the extraction step . <S>",
            "relation": "COMPARISON"
        },
        {
            "id": 3,
            "parent": 0,
            "text": "Herein , we assessed the RNA extraction necessity ,",
            "relation": "ROOT"
        },
        {
            "id": 4,
            "parent": 3,
            "text": "by comparing RT - qPCR efficacy in several direct approaches vs. the gold standard RNA extraction ,",
            "relation": "MANNER-MEANS"
        },
        {
            "id": 5,
            "parent": 3,
            "text": "in detection of SARS - CoV-2 from laboratory samples as well as clinical Oro - nasopharyngeal SARS - CoV-2 swabs . <S>",
            "relation": "SAME-UNIT"
        },
        {
            "id": 6,
            "parent": 3,
            "text": "Our findings show advantage for the extraction procedure ,",
            "relation": "FINDINGS"
        },
        {
            "id": 7,
            "parent": 6,
            "text": "however a direct no - buffer approach might be an alternative ,",
            "relation": "COMPARISON"
        },
        {
            "id": 8,
            "parent": 7,
            "text": "since it identified up to 70 % of positive clinical specimens . <S> <P>",
            "relation": "CAUSE-RESULT"
        }
    ]
}

"<S>" and "<P>" denote the sentence and paragraph boundaries, respectively.

Citation

The blank entries (i.e., volume, pages, doi) in the bibtex item below will be filled in after publication in TACL.

@article{nishida2021outofdomain,
    title={Out-of-Domain Discourse Dependency Parsing via Bootstrapping: {A}n Empirical Analysis on its Effectiveness and Limitation},
    author={Nishida, Noriki and Matsumoto, Yuji},
    journal={Transactions of the Association for Computational Linguistics},
    volume={},
    pages={},
    year={2021},
    doi={},
}

Releases

No releases published

Packages

No packages published