Skip to content

abrazinskas/Copycat-abstractive-opinion-summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unsupervised Opinion Summarization as Copycat-Review Generation

This repository contains the Python (PyTorch) codebase of the corresponding paper accepted at ACL 2020, Seattle, USA.

The model is fully unsupervised and is trained on a large corpus of customer reviews, such as Yelp or Amazon. It generates abstractive summaries condensing common opinions across a group of reviews. It relies on Bayesian auto-encoding that fosters learning rich hierarchical semantic representations of reviews and products. Finally, the model uses a copy mechanism to better preserve details of input reviews.

Example summaries produced by the system are shown below.

  • This restaurant is a hidden gem in Toronto. The food is delicious, and the service is impeccable. Highly recommend for anyone who likes French bistro.

  • This is a great case for the Acer Aspire 14" laptop. It is a little snug for my laptop, but it's a nice case. I would recommend it to anyone who wants to protect their laptop.

  • This is the best steamer I have ever owned. It is easy to use and easy to clean. I have used it several times and it works great. I would recommend it to anyone looking for a steamer.

For more examples, please refer to the artifacts folder.

Installation

The easiest way to proceed is to create a separate conda environment.

conda create -n copycat python=3.6.9
conda activate copycat

Install required modules.

pip install -r requirements.txt

Add the root directory to the path.

export PYTHONPATH=root_path:$PYTHONPATH

Data

Our model is trained on two different collections of customer reviews - Amazon and Yelp. The evaluation was performed on human-annotated summaries based on both datasets.

Unsupervised data

The dataset for Yelp and Amazon must be preprocessed and put to the /data folder. See instructions in the preprocessing folder.

Input Data Format

If training should be performed on a separate dataset, the expected format of input is provided in artifacts. Each business/product has to be separated to CSV files where each line corresponds to a separate review.

group_id review_text rating category
159985130X We recommend the Magnifier ... 4.0 health_and_personal_care

The rating column is optional as it is not used by the model.

Evaluation Summaries

Evaluation can be performed on human-created summaries, both Amazon and Yelp summaries are publicly available. No preprocessing is needed for evaluation. The Amazon summaries were created by us using the Mechanical Turk Platform, more information on the process can be found in the corresponding folder.

Running

If you preprocessed data yourself, please create your vocabulary and truecaser. Otherwise, you can skip the following two sections.

Vocabulary Creation

Vocabulary contains to a mapping from words to frequency, where file position corresponds to ids used by the model.

python copycat/scripts/create_vocabulary.py --data_path=your_data_path --vocab_fp=data/dataset_name/vocabs/vocab.txt

Truecaser Creation

Truecaser is used to reverse lowercase letters, and needs to be trained (quickly) by scanning the dataset. Note that multiple folders can be assigned to the data_path parameter.

python copycat/scripts/train_truecaser.py --data_path=your_data_path --tcaser_fp=data/dataset_name/tcaser.model

Workflow

One needs to set parameters of the workflow in copycat/hparams/run_hp.py. E.g., by altering data paths or specifying the number of training epochs.

The file run_workflow.py contains a workflow of operations that are executed to prepare necessary objects (e.g., beam search) and then run a training and/or evaluation procedure. After adjusting run parameters, execute the following command.

python copycat/scripts/run_workflow.py

Summary generation

Generation of summaries from CSV files can also be done via the run_workflow.py file. The input must be in the CSV format as in copycat/amazon/data/infer_input.csv. Each review column must be in the format 'rev1', ..., 'revN'. Tab should be used as a separator.

python copycat/scripts/run_workflow.py --infer-input-file-path=your_csv_input_file_path --infer-batch-size=20

Checkpoints

Amazon and Yelp checkpoints are available for download. Please put them to copycat/artifacts/, to the corresponding dataset sub-folders.

LICENSE

MIT

Citation

@inproceedings{brazinskas2020-unsupervised,
    title = "Unsupervised Opinion Summarization as Copycat-Review Generation",
    author = "Bra{\v{z}}inskas, Arthur  and
      Lapata, Mirella  and
      Titov, Ivan",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.461",
    doi = "10.18653/v1/2020.acl-main.461",
    pages = "5151--5169"
}

Notes

  • Minor deviations from the published results are expected as the code was migrated from a bleeding-edge PyTorch version and Python 2.7.

  • Post factum, we added a beam search generator that has the n-gram blocking functionality (based on OpenNMT). The enhancement reduces repetitions.

  • The setup was fully tested with Python 3.6.9.

  • The model work on a single GPU only.

  • mltoolkit provides the backbone functionality for data processing and modelling. Make sure it's visible to the interpreter.