FactorSum

Supporting code for the paper Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents.

Abstract

We argue that disentangling content selection from the budget used to cover salient content improves the performance and applicability of abstractive summarizers. Our method, FactorSum, does this disentanglement by factorizing summarization into two steps through an energy function:

Intrinsic importance model: generation of abstractive summary views.
Extrinsic importance model: combination of these views into a final summary, following a budget and content guidance.

This extrinsic guidance may come from different sources, including from an advisor model such as BART or BigBird, or in oracle mode -- from the reference. This factorization achieves significantly higher ROUGE scores on multiple benchmarks for long document summarization, namely PubMed, arXiv, and GovReport. Most notably, our model is effective for domain adaptation. When trained only on PubMed samples, it achieves a 46.29 ROUGE-1 score on arXiv, which indicates a strong performance due to more flexible budget adaptation and content selection less dependent on domain-specific textual structure.

Getting started

Clone this repository and install the dependencies:

git clone https://github.com/thefonseca/factorsum.git
cd factorsum
# Optional: checkout the arXiv version 2205.12486v2 for reproducibility
git checkout 2205.12486v2
# Install dependencies
pip install -r requirements.txt

Usage

Example: summarizing a single document using a budget guidance and source content guidance.

training_domain = 'arxiv'
model = FactorSum(training_domain)
summary = model.summarize(
        document, # a document string
        budget_guidance=200, # budget guidance in tokens
        source_token_budget=budget_guidance, # number of tokens to use from source document as content guidance
        verbose=True,
    )

A command-line tool is provided to explore summary samples and parameters. For instance, to see the summary for the sample 230 from arXiv test set, use the following command (GPU recommended):

python -m factorsum.model --doc_id 230 --dataset_name arxiv --split test \
--budget_guidance=200 --content_guidance_type source

It will output target abstract, the generated summary, and the evaluation scores.

Colab Playground

A Colab notebook is available for summary generation.

Reproducing the evaluation results

The evaluation procedure relies on the following data:

The arXiv, PubMed, and GovReport summarization datasets.
The document views dataset generated by the sampling procedure (refer to Section 2.1 "Sampling Document Views" in the paper).
The summary views predicted from the document views (see Section 2.1.1 in the paper).

For convenience, we provide all the preprocessed resources, which can be downloaded using this command:

python -m factorsum.data download

Alternatively, you can use the instructions below to prepare the resources from scratch.

Prepare data from scratch

Preprocess the summarization datasets (test splits):

python -m factorsum.data prepare_dataset scientific_papers arxiv --split test

python -m factorsum.data prepare_dataset scientific_papers pubmed --split test

python -m factorsum.data prepare_dataset ccdv/govreport-summarization govreport --split test

Then generate the document views for each dataset:

python -m factorsum.data prepare_dataset scientific_papers arxiv --split test --sample_type random --sample_factor 5 --views_per_doc 20

python -m factorsum.data prepare_dataset scientific_papers pubmed --split test --sample_type random --sample_factor 5 --views_per_doc 20

python -m factorsum.data prepare_dataset ccdv/govreport-summarization govreport --split test --sample_type random --sample_factor 5 --views_per_doc 20

Download the intrinsic model importance checkpoints:

python -m factorsum.utils download_models --model_dir ./artifacts

Currently, the checkpoints are:

Finally, generate summary views using the run_summarization.py script (slightly adapted from the original huggingface script). The following command generates summary views for the arXiv test set using the model checkpoint in artifacts/model-rs86h5g0:v0:

MODEL_PATH='artifacts/model-rs86h5g0:v0' \
DATASET='arxiv' SPLIT='test' \
python scripts/run_summarization.py \
    --model_name_or_path "${MODEL_PATH}" \
    --do_predict \
    --output_dir output/"${DATASET}-${SPLIT}-summary_views" \
    --per_device_eval_batch_size=8 \
    --overwrite_output_dir \
    --predict_with_generate \
    --validation_file "data/${DATASET}-random_k_5_samples_20_${SPLIT}.csv" \
    --test_file "data/${DATASET}-random_k_5_samples_20_${SPLIT}.csv" \
    --text_column source \
    --summary_column target \
    --generation_max_length 128 \
    --generation_num_beams 4

It will generate a generated_predictions.pkl in the output_dir folder. To use the summary views, this file has to be moved to the data folder according to this naming convention:

cp output/"${DATASET}-${SPLIT}-summary_views/generated_predictions.pkl" data/"${DATASET}-${SPLIT}-summary_views-bart-${TRAINING_DOMAIN}-run=${RUN_ID}.pkl"

For instance, for the arXiv test set in-domain summary views we would have:

cp output/arxiv-test-summary_views/generated_predictions.pkl data/arxiv-test-summary_views-bart-arxiv-run=rs86h5g0.pkl

To generate summary views in a cross-domain setting, just set the variables MODEL_PATH and DATASET accordingly.

Hyperparameters

Refer to the file config.py for hyperparameter definitions.

In-domain evaluation

The in-domain summarization results in Table 2 in the paper are obtained with the following command:

python -m evaluation.factorsum evaluate --dataset_name arxiv --split test --output_dir output

where dataset_name is arxiv, pubmed, or govreport. By default, scores and summary predictions are saved to the ./output folder.

Cross-domain evaluation

(Results in Table 3 of the paper) To specify the training domain of the intrinsic model, use the training_domain option. The following example performs cross-domain evaluation on the arXiv dataset, using summary views generated by a model trained on PubMed.

python -m evaluation.factorsum evaluate --dataset_name arxiv --split test --training_domain pubmed

Varying budget guidance

Results for the experiments with varying budget guidance (Appendix D in the paper) are obtained with the following command:

python -m evaluation.budgets --dataset_name <dataset_name> --split test

where dataset_name is arxiv, pubmed, or govreport.

Baselines

PEGASUS predictions:

python scripts/run_summarization.py \
    --model_name_or_path google/pegasus-arxiv \
    --do_predict \
    --output_dir /output \
    --per_device_eval_batch_size 4 \
    --overwrite_output_dir \
    --predict_with_generate \
    --generation_max_length 256 \
    --generation_num_beams 8 \
    --val_max_target_length 256 \
    --max_source_length 1024 \
    --dataset_name scientific_papers \
    --dataset_config arxiv \
    --predict_split test

BigBird predictions:

python scripts/run_summarization.py \
    --model_name_or_path google/bigbird-pegasus-large-arxiv \
    --do_predict \
    --output_dir /content/output \
    --per_device_eval_batch_size 4 \
    --overwrite_output_dir \
    --predict_with_generate \
    --report_to none \
    --generation_max_length 256 \
    --generation_num_beams 5 \
    --val_max_target_length 256 \
    --max_source_length 3072 \
    --dataset_name scientific_papers \
    --dataset_config arxiv \
    --predict_split test

Training the intrinsic importance model

First, make sure the data for all splits are available (processing of the training sets might take several minutes):

python -m factorsum.data prepare_dataset scientific_papers arxiv
python -m factorsum.data prepare_dataset scientific_papers pubmed
python -m factorsum.data prepare_dataset ccdv/govreport-summarization govreport

Then run the training script as follows:

DATASET='arxiv' \
python scripts/run_summarization.py \
    --model_name_or_path facebook/bart-base \
    --do_train \
    --do_eval \
    --do_predict \
    --output_dir output/"${DATASET}"-k_5_samples_20 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --predict_with_generate \
    --gradient_accumulation_steps 4 \
    --generation_max_length 128 \
    --generation_num_beams 4 \
    --val_max_target_length 128 \
    --max_source_length 1024 \
    --max_target_length 128 \
    --fp16 \
    --save_total_limit 2 \
    --save_strategy steps \
    --evaluation_strategy steps \
    --save_steps 5000 \
    --eval_steps 5000 \
    --max_steps 50000 \
    --learning_rate 5e-5 \
    --report_to wandb \
    --metric_for_best_model eval_rouge1_fmeasure \
    --load_best_model_at_end \
    --max_train_samples 4000000 \
    --max_eval_samples 10000 \
    --max_predict_samples 10000 \
    --train_file data/"${DATASET}"-random_k_5_samples_20_train.csv \
    --validation_file data/"${DATASET}"-random_k_5_samples_20_validation.csv \
    --test_file data/"${DATASET}"-random_k_5_samples_20_test.csv \
    --text_column source \
    --summary_column target \
    --seed 17

Note: to use mixed precision (--fp16) you need a compatible CUDA device.

Citation

@inproceedings{fonseca2022factorizing,
 author = {Fonseca, Marcio and Ziser, Yftah and Cohen, Shay B.},
 booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
 location = {Abu Dhabi},
 publisher = {Association for Computational Linguistics},
 title = {Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents},
 year = {2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
evaluation		evaluation
factorsum		factorsum
scripts		scripts
.gitignore		.gitignore
CITATION.bib		CITATION.bib
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

evaluation

factorsum

factorsum

scripts

scripts

.gitignore

.gitignore

CITATION.bib

CITATION.bib

README.md

README.md

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

Repository files navigation

FactorSum

Abstract

Getting started

Usage

Colab Playground

Reproducing the evaluation results

Prepare data from scratch

Hyperparameters

In-domain evaluation

Cross-domain evaluation

Varying budget guidance

Baselines

Training the intrinsic importance model

Citation

About

Releases

Packages

Languages

thefonseca/factorsum

Folders and files

Latest commit

History

Repository files navigation

FactorSum

Abstract

Getting started

Usage

Colab Playground

Reproducing the evaluation results

Prepare data from scratch

Hyperparameters

In-domain evaluation

Cross-domain evaluation

Varying budget guidance

Baselines

Training the intrinsic importance model

Citation

About

Resources

Stars

Watchers

Forks

Languages