GitHub - tchewik/corefhd: This is our solution for the RuCoCo-23 shared task described in "Light Coreference Resolution for Russian with Hierarchical Discourse Features"

Light Coreference Resolution for Russian with Hierarchical Discourse Features

This is our solution for RuCoCo-23 shared task: Coreference Resolution in Russian (only single antecedent resolution).

1. Set up the syntax & NER parser

(Option 1) With Docker

Run the container locally or remotely using the following command:

   docker run --rm -d -p 3334:3333 --name spacy_ru tchewik/isanlp_spacy:ru

Connect to it from Python:

from isanlp.processor_remote import ProcessorRemote
 
spacy_address = ['0.0.0.0', 3334]
spacy_processor = (ProcessorRemote(spacy_address[0], spacy_address[1], '0'),
                   ['tokens', 'sentences'],
                   {'lemma': 'lemma',
                    'postag': 'postag',
                    'morph': 'morph',
                    'syntax_dep_tree': 'syntax_dep_tree',
                    'entities': 'entities'})

(Option 2) Locally

Download the model

python -m spacy download ru_core_news_lg

Initialize in Python using ProcessorSpaCy

from isanlp.processor_spacy import ProcessorSpaCy

spacy_processor = (ProcessorSpaCy(model_name='ru_core_news_lg'),
                  ['tokens', 'sentences'],
                  {'lemma': 'lemma',
                   'postag': 'postag',
                   'morph': 'morph',
                   'syntax_dep_tree': 'syntax_dep_tree',
                   'entities': 'entities'})

2. Set up the RST parser (only for `model_rh`)

(Only option) With Docker

Run the container locally or remotely using the following command:

docker run --rm -d -p 3335:3333 --name rst_ru tchewik/isanlp_rst:2.1-rstreebank

Connect to it from Python:

from isanlp.processor_remote import ProcessorRemote

rst_address = ['0.0.0.0', 3335]
rst_processor = (ProcessorRemote(rst_address[0], rst_address[1], 'default'),
                 ['text', 'tokens', 'sentences', 'postag', 'morph', 'lemma', 'syntax_dep_tree'],
                 {'rst': 'rst'})

3. Set up the coreference resolver

There are two models from the test leaderboard of RuCoCo-23: base and Rh-enhanced. The latter requires RST parsing which makes it slow. There are also two options for running: with Docker or locally.

name	F1 (dev)	F1 (test)	time (example, CPU only)	for local run (place into `models/`)	docker image
base	74.3	72.8	~883 ms	model_base.tar.gz	`tchewik/corefhd:base`
base+rh	74.6	73.3	~19 s	model_rh.tar.gz	`tchewik/corefhd:rh`

(Option 1) With Docker

Run the container locally or remotely using the following command using selected tag (base or rh):
```
   docker run --rm -d -p 3336:3333 --name corefhd tchewik/isanlp_corefhd:<tag>
```

Connect to it from Python:

from isanlp.processor_remote import ProcessorRemote

coref_address = ['0.0.0.0', 3336]

# Base model
corefhd = (ProcessorRemote(coref_address[0], coref_address[1], 'default'),
           ['text', 'tokens', 'sentences',
            'lemma', 'postag', 'syntax_dep_tree', 'entities'],
           {'entity_clusters': 'entity_clusters'})

# Rh model
corefhd = (ProcessorRemote(coref_address[0], coref_address[1], 'default'),
           ['text', 'tokens', 'sentences',
            'lemma', 'postag', 'syntax_dep_tree', 'entities', 'rst'],
           {'entity_clusters': 'entity_clusters'})

(Option 2) Locally

Download the model as models/model_base.tar.gz or models/model_rh.tar.gz (link in the table).
Find the python path for allennlp and update for LUKE (see load_custom_allennlp_scripts.bash)

Initialize in Python using ProcessorCorefHD:

from processor_corefhd import ProcessorCorefHD

# Base model
corefhd_processor = (ProcessorCorefHD(cuda_device=-1, use_discourse=False),
           ['text', 'tokens', 'sentences',
            'lemma', 'postag', 'syntax_dep_tree', 'entities'],
           {0: 'entity_clusters'})

# Rh model
corefhd_processor = (ProcessorCorefHD(cuda_device=-1, use_discourse=True),
           ['text', 'tokens', 'sentences',
            'lemma', 'postag', 'syntax_dep_tree', 'entities', 'rst'],
           {'entity_clusters': 'entity_clusters'})

4. Process the texts

Construct the pipeline from initialized processors:

For base model

  from isanlp import PipelineCommon
  from isanlp.processor_razdel import ProcessorRazdel

  ppl = PipelineCommon([
     (ProcessorRazdel(), ['text'],
      {'tokens': 'tokens',
       'sentences': 'sentences'}),
     spacy_processor,
     corefhd_processor
  ])

For Rh model

  from isanlp import PipelineCommon
  from isanlp.processor_razdel import ProcessorRazdel

  ppl = PipelineCommon([
     (ProcessorRazdel(), ['text'],
      {'tokens': 'tokens',
       'sentences': 'sentences'}),
     spacy_processor,
     rst_processor,
     corefhd_processor
  ])

Run the constructed pipeline:

text = open('text_example.txt', 'r').read().strip()
result = ppl(text)

The result is given in token spans:

   >>> result['entity_clusters']
   [[[0, 1], [7, 7], [19, 19], [103, 104], [126, 126]],
    [[23, 27], [30, 30]],
    [[68, 69], [72, 72]],
    [[78, 83], [132, 132]],
    [[44, 53], [138, 138], [152, 152]],
    [[133, 134], [140, 140], [149, 149]],
    [[89, 90], [142, 142]]]

Example finding the corresponding text spans:

def print_coreference_clusters(text, tokens, entity_clusters):
   def mention_to_str(mention):
       return text[tokens[mention[0]].begin: tokens[mention[1]].end]
   for entity in entity_clusters:
       print(f'{mention_to_str(entity[0])} ::: {[mention_to_str(mention) for mention in entity[1:]]}')
   
>>> print_coreference_clusters(result['text'], result['tokens'], result['entity_clusters'])
Иоганн Шильтбергер ::: ['он', 'отрок', 'сам Иоганн', 'он']
рыцаря по имени Леонгарт Рихартингер ::: ['его']
венгерские крестоносцы ::: ['которым']
24-летним сыном герцога Бургундии Жаном Бесстрашным ::: ['Жана']
венгерский король и будущий император Священной Римской империи Сигизмунд I ::: ['Сигизмунда', 'Сигизмунд']
бургундские рыцари ::: ['Они', 'им']
турецкой армией ::: ['турок']

Cite

Further information and examples can be found in our paper:

@INPROCEEDINGS{chistova2023light,
      author = {Chistova, E. and Smirnov, I.},
      title = {Light Coreference Resolution for Russian with Hierarchical Discourse Features},
      booktitle = {Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference "Dialogue" (2023)},
      year = {2023},
      number = {22},
      pages = {34--41}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
Dockerfile-base		Dockerfile-base
Dockerfile-rh		Dockerfile-rh
LICENSE		LICENSE
README.md		README.md
load_custom_allennlp_scripts.bash		load_custom_allennlp_scripts.bash
load_custom_allennlp_scripts_docker.bash		load_custom_allennlp_scripts_docker.bash
pipeline_object_base.py		pipeline_object_base.py
pipeline_object_rh.py		pipeline_object_rh.py
processor_corefhd.py		processor_corefhd.py
start.py		start.py
text_example.txt		text_example.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

Dockerfile-base

Dockerfile-base

Dockerfile-rh

Dockerfile-rh

LICENSE

LICENSE

README.md

README.md

load_custom_allennlp_scripts.bash

load_custom_allennlp_scripts.bash

load_custom_allennlp_scripts_docker.bash

load_custom_allennlp_scripts_docker.bash

pipeline_object_base.py

pipeline_object_base.py

pipeline_object_rh.py

pipeline_object_rh.py

processor_corefhd.py

processor_corefhd.py

start.py

start.py

text_example.txt

text_example.txt

Repository files navigation

Light Coreference Resolution for Russian with Hierarchical Discourse Features

1. Set up the syntax & NER parser

2. Set up the RST parser (only for `model_rh`)

3. Set up the coreference resolver

4. Process the texts

Cite

About

Languages

License

tchewik/corefhd

Folders and files

Latest commit

History

Repository files navigation

Light Coreference Resolution for Russian with Hierarchical Discourse Features

1. Set up the syntax & NER parser

2. Set up the RST parser (only for model_rh)

3. Set up the coreference resolver

4. Process the texts

Cite

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

2. Set up the RST parser (only for `model_rh`)