Skip to content

IlyaGusev/summarus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

summarus

Tests Status Code Climate

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

You can also checkout the MBART-based Russian summarization model on Huggingface: mbart_ru_sum_gazeta

Based on the following papers:

Contacts

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

Argument Required Description
-c true path to file with configuration
-s true path to directory where model will be saved
-t true path to train dataset
-v true path to val dataset
-r false recover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

Argument Required Default Description
-t true path to test dataset
-m true path to tar.gz archive with model
-p true name of Predictor
-c false 0 CUDA device
-L true Language ("ru" or "en")
-b false 32 size of a batch with test examples to run simultaneously
-M false path to meteor.jar for Meteor metric
-T false tokenize gold and predicted summaries before metrics calculation
-D false save temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

Argument Default Description
--train-path path to train dataset
--model-path path to directory where generated subword model will be saved
--model-type bpe type of subword model, see sentencepiece
--vocab-size 50000 size of the resulting subword model vocabulary
--config-path path to file with configuration for DatasetReader (with parse_set)

Headline generation

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru 

Results

Train dataset: RIA, test dataset: RIA
Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 40.0 23.3 37.5 -
ria_pgn_24kk 42.3 25.1 39.6 -
ria_mbart 42.8 25.5 39.9 -
First Sentence 24.1 10.6 16.7 -

Train dataset: RIA, eval dataset: Lenta

Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 25.6 12.3 23.0 -
ria_pgn_24kk 26.4 12.3 24.0 -
ria_mbart 30.3 14.5 27.1 -
First Sentence 25.5 11.2 19.2 -

Summarization - CNN/DailyMail

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
cnndm_pgn_25kk 38.5 16.5 33.4 17.6 -

Summarization - Gazeta, russian news dataset

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
gazeta_pgn_7kk 29.4 12.7 24.6 21.2 9.0
gazeta_pgn_7kk_cov 29.8 12.8 25.4 22.1 10.1
gazeta_pgn_25kk 29.6 12.8 24.6 21.5 9.3
gazeta_pgn_words_13kk 29.4 12.6 24.4 20.9 8.9
gazeta_summarunner_3kk 31.6 13.7 27.1 26.0 11.5
gazeta_mbart 32.6 14.6 28.2 25.7 12.4
gazeta_mbart_lower 32.7 14.7 28.3 25.8 12.5

Demo

python demo/server.py --include-package summarus --model-dir <model_dir> --host <host> --port <port>

Citations

Headline generation (PGN):

@article{Gusev2019headlines,
    author={Gusev, I.O.},
    title={Importance of copying mechanism for news headline generation},
    journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
    year={2019},
    volume={2019-May},
    number={18},
    pages={229--236}
}

Headline generation (transformers):

@InProceedings{Bukhtiyarov2020headlines,
    author={Bukhtiyarov, Alexey and Gusev, Ilya},
    title="Advances of Transformer-Based Models for News Headline Generation",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages={54--61},
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_4}
}

Summarization:

@InProceedings{Gusev2020gazeta,
    author="Gusev, Ilya",
    title="Dataset for Automatic Summarization of Russian News",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages="{122--134}",
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_9}
}