GitHub - rpryzant/proxy-a-distance: Proxy A-Distance algorithm for measuring domain disparity in parallel corpora

Proxy A-Distance

This is an implementation of an algorithm discussed in Ganin et. al (2015), Glorot et. al (2011), and Ben-David et. al (2007). It has been adapted for use with machine translation datasets, and released to the public under the MIT license.

This algorithm computes the Proxy A-Distance (PAD) between two domain distributions. PAD is a measure of similarity between datasets from different domains (e.g. newspapers and talk shows). Intuitively, similar domains => bigger error => smaller PAD. Dissimilar domains => smaller error => bigger PAD. The MAE error metric for binary classification between domains will bound PAD in the range [0, 2].

The algorithm is as follows:

Mix the two datasets. Apply label that indicate each example's origin.
Train a classifier on these merged data.
Measure the classifier's error e on a held-out test set.
Set PAD = 2 (1 − 2e)

We use a linear bag-of-words SVM for the underlying classifier.

Requirements

numpy: pip install numpy
sklearn: pip install sklearn

Usage

python main.py [corpusfile 1] [corpusfile 2] [vocab file]

corpusfile 1 is a text file with one sentence per line.
corpusfile 2 is another text file with one sentence per line.
vocab is a text file with one token per line. These tokens represent a shared vocabulary for the above corpusfiles.

Example

python main.py test_data/europarl.en test_data/europarl.fr test_data/opensubtitles.en test_data/opensubtitles.fr test_data/vocab

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
src		src
test_data		test_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

test_data

test_data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

Repository files navigation

Proxy A-Distance

Requirements

Usage

Example

About

Releases

Packages

Languages

License

rpryzant/proxy-a-distance

Folders and files

Latest commit

History

Repository files navigation

Proxy A-Distance

Requirements

Usage

Example

About

Resources

License

Stars

Watchers

Forks

Languages