HIRO: Hierarchical Indexing for Retrieval-Augmented Opinion Summarization

Tom Hosking, Hao Tang and Mirella Lapata

Setup

Create a fresh environment:

conda create -n hiroenv python=3.9
conda activate hiroenv

or

python3 -m venv hiroenv
source hiroenv/bin/activate

Then install dependencies:

pip install -r requirements.txt

Download data/models:

Space -> ./data
AmaSum -> ./data
Trained checkpoints -> ./models

Tested with Python 3.9.

Generating Summaries for Space/Amasum

First, run the pre-processing eval recipe:

torchseq-eval --recipe opagg.hiro_pre --model ./models/20240130_183901_d671_space --test

Then, get generations from your preferred LLM (we used Mistral 7B Instruct v0.2), based on the prompts in the files eval/llm_inputs_piecewise_test.jsonl and eval/llm_inputs_oneshot_test.jsonl. You might want to use my TGI Client to run efficient batched inference through a HuggingFace model:

python tgi-client/runner.py --input runs/hiro/space/llm_inputs_oneshot_test.jsonl --output runs/hiro/space/llm_outputs_oneshot_test_mistaral7b.js
onl --model mistralai/Mistral-7B-Instruct-v0.2

Then run the post-LLM eval recipe to get the scores:

torchseq-eval --recipe opagg.hiro_post --model ./models/20240130_183901_d671_space --test

TODO: How to generate summaries for other datasets

Training on Space/Amasum

(Optional) Rebuild the datasets:

Run the dataset filtering scripts ./scripts/opagg_filter_space.py and ./scripts/opagg_filter_space_eval.py

# SPACE
python scripts/generate_opagg_posnegtriples.py --dataset space-25toks-1pronouns  --min_pos_score 0.75 --ignore_neutral --unsorted --min_overlap 0.3

# AmaSum
python scripts/generate_opagg_posnegtriples.py --dataset amasum-electronics-25toks-0pronouns  --min_pos_score 0.75 --ignore_neutral --unsorted --min_overlap 0.3
python scripts/generate_opagg_posnegtriples.py --dataset amasum-shoes-25toks-0pronouns  --min_pos_score 0.75 --ignore_neutral --unsorted --min_overlap 0.3
python scripts/generate_opagg_posnegtriples.py --dataset amasum-sports-outdoors-25toks-0pronouns  --min_pos_score 0.75 --ignore_neutral --unsorted --min_overlap 0.3
python scripts/generate_opagg_posnegtriples.py --dataset amasum-home-kitchen-25toks-0pronouns  --min_pos_score 0.75 --ignore_neutral --unsorted --min_overlap 0.3

Train:

torchseq --train --reload_after_train --validate --config ./configs/hiro_space.json

Training on a new dataset

Setting up to train on a new dataset can be a bit tricky - I'm happy to help run you through the process, just email me or raise an issueon Github.

Make a copy of your dataset in a format expected by the preprocessing scripts
Clean the training data and eval data
Generate training pairs
Modify one of the configs to point to your data
Train the model

torchseq --train --reload_after_train --validate --config ./configs/{YOUR_CONFIG}.json

Citation

@misc{hosking2024hierarchical,
      title={Hierarchical Indexing for Retrieval-Augmented Opinion Summarization}, 
      author={Tom Hosking and Hao Tang and Mirella Lapata},
      year={2024},
      eprint={2403.00435},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
data		data
output		output
plots		plots
web		web
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

data

data

output

output

plots

plots

web

web

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

HIRO: Hierarchical Indexing for Retrieval-Augmented Opinion Summarization

Setup

Generating Summaries for Space/Amasum

Training on Space/Amasum

Training on a new dataset

Citation

About

Releases

Packages

Languages

License

tomhosking/hiro

Folders and files

Latest commit

History

Repository files navigation

Setup

Generating Summaries for Space/Amasum

Training on Space/Amasum

Training on a new dataset

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages