Skip to content

SYSTRAN/similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

similarity

Bilingual sentence similarity classifier based on optimising word alignments using Tensorflow.

This repo implements a sentence similarity classifier model using Tensorflow. Similarity classification is based on the ideas introduced by Carpuat et al., 2017 and similar to Vyas et al., 2018, Schwenk, 2018 and Grégoire et al., 2018. The code borrows many of the concepts and architecture presented in Legrand et al., 2016.

Details on the implementation and experiments are published in:

The next picture, shows an example of similarity classification for the sentence pair:

What do you feel ? Not . ||| Que ressentez-vous ?

As it can be seen, the model outputs:

  • a matrix with alignment scores,
  • word aggregation scores (shown next to each source/target word) and
  • an overall sentence pair similarity score (+0.1201).

In the previous paper we show that divergent sentences can be filtered out (using the sentence similarity score) and that some divergences can be fixed (following alignment scores), guiding in both cases to outperform accuracy when compared to neural MT systems using the original corpora. For our experiments we used the English-French OpenSubtitles and the English-German Paracrawl corpora.

Installation

pip install requirements.txt

A docker can also be built integrating all requirements with:

docker build -t systran/similarity -f Dockerfile .

Preprocess

In order to learn our similarity model we better preprocess our training data with any tokenisation toolkit, basically aiming at reducing the vocabulary size. Any subtokenisation toolkit (such as BPE) can also be used. In our experiments we used the default tokenisation scheme implemented in OpenNMT performing minimal tokenisation without subtokenisation. Any OpenNMT tokenization can be performed on the fly on the input data given a json configuration file.

Vocabularies

After tokenisation, the most frequent |Vs| source and |Vt| target words are considered to be part of the source and target vocabularies respectively. The remaining will be mapped to a special UNK token. In our experiments we used |Vs| = |Vt| = 50,000 words.

Pre-trained word embeddings

Any initialisation of source and target word embeddings can be used. In our experiments we initialised both source and target embeddings using fastText with |Es| = |Et| = 256 cells. Embeddings were further refined using MUSE. Note that word embeddings are not needed to learn the similarity model.

Word alignments and Part-of-Speeches

To generate some training examples we will need to perform word alignments and POS-tagging of the source sentences. In our experiments we used fast_align and Freeling to perform word alignment and English POS tagging respectively. Note that neither word alignments nor POS tags are needed to learn the similarity model.

Once the training parallel corpora is preprocessed we are ready to prepare our training examples:

python -u src/build_data.py
   -seq_size       INT : sentences larger than this number of src/tgt words are filtered out [50]
   -max_sents      INT : Consider this number of sentences per batch (0 for all) [0]
   -seed           INT : seed for randomness [1234]
   -shuffle            : shuffle data
   -debug              : debug mode
   -h                  : this help

*  -data          FILE : training data
   -mode        STRING : how data examples are generated (p: parallel, u:uneven, i:insert, r:replace d:delete) [p]
   -replace       FILE : equivalent sequences (needed when -data_mode contains r)

+ Options marked with * must be set. The rest have default values.

The input data file contains one sentence pair per line, with the following fields separated by TABs:

  • source sentence
  • target sentence
  • source/target alignments
  • source part-of-speeches
Why wait for the Euro ?   Pourquoi attendre l' Euro ?   0-0 1-1 2-1 3-2 4-3 5-4   WRB VB IN DT NNP .

alternatively, if data path can be multiple files separated by comma.

(The last two fields are optional)

Available modes:

  • 'p': parallel sentences
Why wait for the Euro ?   Pourquoi attendre l' Euro ?   -1.0 -1.0 -1.0 -1.0 -1.0 -1.0   -1.0 -1.0 -1.0 -1.0 -1.0

Parallel sentences from the bitext.

  • 'u': uneven sentences
Why wait for the Euro ?   Cela peut donc se produire .   1.0 1.0 1.0 1.0 1.0 1.0   1.0 1.0 1.0 1.0 1.0 1.0

Uneven sentences from the bitext.

  • 'i': insert sentence
Why wait for the Euro ?   Pourquoi attendre l' Euro ? Il existe un précédant .   -1.0 -1.0 -1.0 -1.0 -1.0 -1.0   -1.0 -1.0 -1.0 -1.0 -1.0 1.0 1.0 1.0 1.0 1.0

Inserted the sentence 'Il existe un précédant .' at the end of the original target sentence.

  • 'd': delete sequence
Why wait for the Euro ?   l' Euro ?   1.0 1.0 1.0 -1.0 -1.0 -1.0   -1.0 -1.0 -1.0

Deleted the sequence 'Pourquoi attendre' from the original target sentence.

(needs word alignments in input FILE)

  • 'r': replace sequence with equivalent part-of-speech
Where wait for the Euro ?  Pourquoi attendre l ' Euro ?  1.0 -1.0 -1.0 -1.0 -1.0 -1.0  1.0 -1.0 -1.0 -1.0 -1.0 -1.0

The sequence 'Why' of the original source sentence has been replaced by 'Where' having the same POS tags.

(needs word alignments and source POS-tags in -data FILE and equivalent sequences in -replace FILE)

Learning

python -u src/similarity.py
*  -mdir          FILE : directory to save/restore models
   -seq_size       INT : sentences larger than this number of src/tgt words are filtered out [50]
   -batch_size     INT : number of examples per batch [32]
   -seed           INT : seed for randomness [1234]
   -debug              : debug mode
 [LEARNING OPTIONS]
*  -trn           FILE : training data
   -dev           FILE : validation data
   -src_tok       FILE : if provided, json tokenization options for onmt tokenization, points to vocabulary file
   -src_voc       FILE : vocabulary of src words (needed to initialize learning)
   -tgt_tok       FILE : if provided, json tokenization options for onmt tokenization, points to vocabulary file
   -tgt_voc       FILE : vocabulary of tgt words (needed to initialize learning)
   -src_emb       FILE : embeddings of src words (needed to initialize learning)
   -tgt_emb       FILE : embeddings of tgt words (needed to initialize learning)
   -src_emb_size   INT : size of src embeddings if -src_emb not used
   -tgt_emb_size   INT : size of tgt embeddings if -tgt_emb not used
   -src_lstm_size  INT : hidden units for src bi-lstm [256]
   -tgt_lstm_size  INT : hidden units for tgt bi-lstm [256]
   -lr           FLOAT : initial learning rate [1.0]
   -lr_decay     FLOAT : learning rate decay [0.9]
   -lr_method   STRING : GD method either: adam, adagrad, adadelta, sgd, rmsprop [adagrad]
   -aggr          TYPE : aggregation operation: sum, max, lse [lse]
   -r            FLOAT : r for lse [1.0]
   -dropout      FLOAT : dropout ratio [0.3]
   -mode        STRING : mode (alignment, sentence) [alignment]
   -max_sents      INT : Consider this number of sentences per batch (0 for all) [0]
   -n_epochs       INT : train for this number of epochs [1]
   -report_every   INT : report every this many batches [1000]

+ Options marked with * must be set. The rest have default values.
+ If -mdir exists in learning mode, learning continues after restoring the last model
+ Training data is shuffled at every epoch

Inference

python -u src/similarity.py
*  -mdir          FILE : directory to save/restore models
   -batch_size     INT : number of examples per batch [32]
   -seed           INT : seed for randomness [1234]
   -debug              : debug mode
 [INFERENCE OPTIONS]
   -epoch          INT : epoch to use ([mdir]/epoch[epoch] must exist, by default the latest one in mdir)
*  -tst           FILE : testing data
   -output        FILE : output file [- by default is STDOUT]
   -q                  : quiet mode, just output similarity score
   -show_matrix        : output formatted alignment matrix (mode must be alignment)
   -show_svg           : output alignment matrix using svg-like html format (mode must be alignment)
   -show_align         : output source/target alignment matrix (mode must be alignment)
   -show_last          : output source/target last vectors
   -show_aggr          : output source/target aggr vectors

+ Options marked with * must be set. The rest have default values.
+ -show_last, -show_aggr and -show_align can be used at the same time

If files tokenization_src.json or tokenization_tgt.json are found in the model directory, the corresponding OpenNMT tokenization and sub-tokenization is performed on the fly - for instance:

{
   "mode": "aggressive",
   "vocabulary": "vocab.en"
}

Fixing sentence pairs

python -u src/fix.py [-tau INT] [-nbest INT] [-max_sim FLOAT] [-use_punct] < FILE_WITH_ALIGNMENTS

About

Bilingual sentence similarity classifier using Tensorflow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages