Skip to content

manishb89/interpretable_sentence_similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Implementation of our IJCAI 20 paper on Logic Constrained Pointer Networks for Interpretable Sentence Similarity.

Citation

@inproceedings{ijcai2020-333,
  title     = {Logic Constrained Pointer Networks for Interpretable Textual Similarity},
  author    = {Maji, Subhadeep and Kumar, Rohan and Bansal, Manish and Roy, Kalyani and Goyal, Pawan},
  booktitle = {Proceedings of the Twenty-Ninth International Joint Conference on
               Artificial Intelligence, {IJCAI-20}},
  publisher = {International Joint Conferences on Artificial Intelligence Organization},             
  editor    = {Christian Bessiere}	
  pages     = {2405--2411},
  year      = {2020},
  month     = {7},
  note      = {Main track}
  doi       = {10.24963/ijcai.2020/333},
  url       = {https://doi.org/10.24963/ijcai.2020/333},
}

Prerequisites

pytorch
pytorch-pretrained-bert
gensim
numpy
spacy
requests
tqdm
nltk
lxml

Setup/Installation

SemEval-2016 iSTS task dataset (headlines and images):

  • Download train/test sets and unzip under datasets/sts_16 directory.

Spacy:

pip install spacy
python -m spacy download en

pytorch-pretrained-bert:

from pytorch_pretrained_bert import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Note: Downloading pre-trained BertModel (bert-base-uncased) might take longer depending on internet connectivity.

Steps to reproduce

Note: Following scripts default to the headlines dataset but can be run for the images dataset as well.

1. Generate BERT embeddings

python src/corpus/chunk_embedding.py --gold_alignments STSint.input.headlines.wa 
                                     --left_chunks STSint.input.headlines.sent1.chunk.txt 
                                     --right_chunks STSint.input.headlines.sent2.chunk.txt 
                                     --output_file bert_base_uncased_input_headlines_1536._emb.bin 

Note: Generate embeddings for both train and test separately.

2. Generate resource files for FOL constraints

cd src
# Create ConceptNet cache of related chunks from all sentences
python scripts/create_cn_cache.py --data_dir ../datasets/sts_16

# Create mapping resource file from left & right sentences according to ConceptNet relations
python scripts/create_constr_map.py --data_dir ../datasets/sts_16

3. Train and Evaluate model

Set all resources paths e.g. iSTS chunk dataset files, FOL constraints & BERT embeddings generated as per above instructions) in training/configuration.py to appropriate variables. Other hyperparameters could also be controlled via configuration.py

  • output_constr can be “C1” or “” to disable structured knowledge constraints (R1)
  • syn_scores is a boolean that enables/disables syntactic constraints (R2)
  • rho sets importance of constraints
  • gpuid sets the gpu id for pytorch
  • max_epoch controls number of epochs
  • patience parameter used for early stopping

At the end, train.py saves best model checkpoint in the model.checkpoint file and evaluates F1-score on test set.

Sample commands:

# To run in default settings i.e. without constraints.
python training/train.py

# Enable constraint
# Default resource is cn_combined_unigram_content_only.json in respective train/test path
python training/train.py --constraint

# If constraint resource needs to be changed
python training/train.py --constraint --resource cn_combined_bigram.json

# To run on another dataset
python training/train.py --dataset_type image

# To change hidden dimension in pointer network
python training/train.py --hidden_dim 150 --constraint

# Change rho
python training/train.py --hidden_dim 150 --constraint --rho 2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published