Skip to content

iis-research-team/terminator

Repository files navigation

Terminator

Tool for Information extraction from Russian texts

This tool includes the following modules:

  • Terms extraction
  • Relation extraction
  • Entity linking
  • Aspect extraction

Installation and preparation

To install:

git clone https://github.com/iis-research-team/Terminator.git

To use this tool one should download the files:

  1. For terms extraction download weights file from here and put it to terms_extractor/dl_extractor/weights

  2. For relation extraction:

2.1. Download config file from here

2.2. Download model file from here

2.3. Download model arguments file from here

and put it all to relation_extractor/dl_relation_extractor/weights

  1. For entity linking:

3.1. Download prepocessed wikidata dump from here, unzip and put it to entity_linker/wikidata_dump;

3.2. Download fasttext model from here and put it to entity_linker/fasttext_model.

  1. For aspect extraction download weights file from here and put it to aspect_extractor/weights

How to use

Terms extraction

This module extracts terms from the raw text.

from terms_extractor.combined_extractor.combined_extractor import CombinedExtractor   

combined_extractor = CombinedExtractor()
text = 'Научные вычисления включают прикладную математику (особенно численный анализ), вычислительную технику ' \
       '(особенно высокопроизводительные вычисления) и математическое моделирование объектов изучаемых научной ' \
       'дисциплиной.'
result = combined_extractor.extract(text)
for token, tag in result:
    print(f'{token} -> {tag}')

Relation extraction

This module extracts relations between two terms. To extract relations it requires text with terms highlighted by special tokens.

Example of relation extraction:

from relation_extractor.combined_relation_extractor.combined_relation_extractor import CombinedRelationExtractor

combined_extractor = CombinedRelationExtractor()
sample = '<e1>Модель</e1> используется в методе генерации и определения форм слов для решения ' \ 
         '<e2>задач морфологического синтеза</e2> и анализа текстов.'

relation = combined_extractor.extract(sample)

Entity linking

This module links terms with entities in Wikidata. It requires extracted terms and their context as input.

from entity_linker.entity_linker import RussianEntityLinker

ru_el = RussianEntityLinker()
term = 'язык программирования Python'
context = ['язык программирования Python', 'использовался', 'в']
print(ru_el.get_linked_mention(term, context))

Aspect extraction

This module extracts aspects from the raw text.

from aspect_extractor import AspectExtractor   

extractor = AspectExtractor()
text = "Определена модель для визуализации связей между объектами и их атрибутами в различных процессах. " \
           "На основании модели разработан универсальный абстрактный компонент графического пользовательского интерфейса и приведены примеры его программной реализации. " \
           "Также проведена апробация компонента для решения прикладной задачи по извлечению информации из документов."
result = extractor.extract(text)
for token, tag in result:
    print(f'{token} -> {tag}')

Data

RuSERRC is the dataset of scientific texts in Russian, which is annotated with terms, aspects, linked entities, and relations.

Citation

If you find this repository useful, feel free to cite our papers:

Bruches E., Tikhobaeva O., Dementyeva Y., Batura T. TERMinator: A System for Scientific Texts Processing. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022). International Committee on Computational Linguistics. 2022. pp. 3420–3426.

@inproceedings{terminator2022,
    title={{TERM}inator: A System for Scientific Texts Processing},
    author={Bruches, Elena and Tikhobaeva, Olga and Dementyeva, Yana and Batura, Tatiana},
    booktitle={Proceedings of the 29th International Conference on Computational Linguistics},
    year={2022},
    pages={3420--3426}
}

Bruches E., Mezentseva A., Batura T. A system for information extraction from scientific texts in Russian. Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2021. Communications in Computer and Information Science. Springer, Cham, 2022. vol. 1620. pp. 234–245.

@inproceedings{ruserrc,
  title={A system for information extraction from scientific texts in Russian},
  author={Bruches, Elena and Mezentseva, Anastasia and Batura, Tatiana},
  booktitle={Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2021. Communications in Computer and Information Science},
  volume={1620}
  pages={234--245},
  year={2022}
}