ELSA: Extractive Linking of Summarization Approaches

Authors: Maksim Eremeev (mae9785@nyu.edu), Mars Wei-Lun Huang (wh2103@nyu.edu), Eric Spector (ejs618@nyu.edu), Jeffrey Tumminia (jt2565@nyu.edu)

Installation

python setup.py build
pip install .

Quick Start with ELSA

from elsa import Elsa

article = '''some text...
'''

abstractive_model_params = {
    'num_beams': 10,
    'max_length': 300,
    'min_length': 55,
    'no_repeat_ngram_size': 3
}

elsa = Elsa(weights=[1, 1], abstractive_base_model='bart', base_dataset='cnn', stopwords='data/stopwords.txt', 
            fasttext_model_path='datasets/cnn/elsa-fasttext-cnn.bin', 
            udpipe_model_path='data/english-ewt-ud-2.5-191206.udpipe')
            
elsa.summarize(article, **abstractive_model_params)

`init` parameters

weights: List[float] -- weights for TextRank and Centroid extractive summarizations.
abstractive_base_model: str -- model used on the abstractive step. Either 'bart' or 'pegasus'.
base dataset: str -- dataset used to train the abstractive model. Either 'cnn' or 'xsum' .
stopwords: str -- path to the list of stopwords.
fasttext_model_path: str -- path to the *.bin checkpoint of a trained FastText model (see below for the training instructions).
udpipe_model_path: str -- path to the *.udpipe checkpoint of the pretrained UDPipe model (see data directory for the files).

`summarize` parameters

factor: float -- percentage (a number from 0 to 1) of sentences to keep in extractive summary (default: 0.5)
use_lemm: bool -- whether to use lemmatization on the preprocessing step (default: False)
use_stem: bool -- whether to use stemming on the preprocessing step (default: False)
check_stopwords: bool -- whether to filter stopwords on the preprocessing step (default: True)
check_length: bool -- whether to filter tokens shorter than 4 symbols (default: True)
abstractive_model_params: dict -- any parameters for the huggingface model's generate method

Datasets used for experiments

CNN-DailyMail: Link, original source: Link

XSum: Link, original source: Link

Gazeta.RU: Link, original source: Link

Downloading & Extracting datasets

wget https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz
wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz
wget https://www.dropbox.com/s/cmpfvzxdknkeal4/gazeta_jsonl.tar.gz

tar -xzf cnndm.tar.gz
tar -xzf XSUM-EMNLP18-Summary-Data-Original.tar.gz
tar -xzf gazeta_jsonl.tar.gz

FastText models

Our trained FastText models

CNN-DailyMail: Link

XSum: Link

Gazeta: Link

See our FastText page for training details.

UDPipe models

UDPipe models available for English:

UDPipe-English EWT: Link (Used in our experiments, see data directory)
UDPipe-English Patut: Link
UDPipe-English Lines: Link
UDPipe-English Gum: Link

Other UDPipe models: Link

Adaptation for Russian

As approach we use for ELSA is language-independent, we can easily adapt it to other languages. For Russian, we finetune mBart on the Gazeta dataset, train additional FastText model, and use UDPipe model built for Russian texts.

UDPipe models for Russian

UDPipe-Russian Syntagrus: Link
UDPipe-Russain GSD: Link (Used in our experiments, see data directory)
UDPipe-Russian Taiga: Link

mBART checkpoint

HuggingFace checkpoint: Link

Codestyle check

Before making a commit / pull-request, please check the coding style by running the bash script in the codestyle directory. Make sure that your folder is included in codestyle/pycodestyle_files.txt list.

Your changes will not be approved if the script indicates any incongruities (this does not apply to 3rd-party code).

Usage:

cd codestyle
sh check_code_style.sh

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
codestyle		codestyle
data		data
demo		demo
elsa		elsa
evaluation		evaluation
fasttext_scripts		fasttext_scripts
worker_compose		worker_compose
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

maks5507/elsa

Folders and files

Latest commit

History

Repository files navigation

ELSA: Extractive Linking of Summarization Approaches

Installation

Quick Start with ELSA

__init__ parameters

summarize parameters

Datasets used for experiments

Downloading & Extracting datasets

FastText models

Our trained FastText models

UDPipe models

Adaptation for Russian

UDPipe models for Russian

mBART checkpoint

Codestyle check

About

Topics

Resources

Stars

Watchers

Forks

Languages

`init` parameters

`summarize` parameters