phonotactic-complexity

This repository contains code for analysing phonotactics. It accompanies the paper: Phonotactic Complexity and Its Trade-offs (Pimentel et al., TACL 2020).

It is a study about languages phonotactics and how it relates to other language features, such as word length.

Install Dependencies

Create a conda environment with

$ source config/conda.sh

And install your appropriate version of PyTorch.

Parse data

First, download NorthEuraLex data from this link and put it in the datasets/northeuralex folder. Then, parse it using the following command:

$ python data_layer/parse.py --data northeuralex

Train models

Train base models

You can train the base models (without shared embeddings) with the commands:

$ python learn_layer/train_base.py --model <model> [--opt]
$ python learn_layer/train_base_bayes.py --model <model>
$ python learn_layer/train_base_cv.py --model <model> [--opt]

Different commands are:

train_base: Trains model with default data split;
train_base_bayes: Trains model using bayesing optimization and default data split;
train_base_cv: Trains cross validated models.

Model can be:

lstm: LSTM with default one hot embeddings
phoible: LSTM with phoible embeddings
phoible-lookup: LSTM with both one hot and phoible embeddings

And --opt is an optional parameter that tells the script to use bayes optimized hyper-parameters. It can only be used after training model with train_base_bayes.

Train shared models

You can train models with shared embeddings using the commands:

$ python learn_layer/train_shared.py --model <model> [--opt]
$ python learn_layer/train_shared_bayes.py --model <model>
$ python learn_layer/train_shared_cv.py --model <model> [--opt]

Model can be:

shared-lstm: LSTM with shared one hot embeddings
shared-phoible: LSTM with shared phoible embeddings
shared-phoible-lookup: LSTM with both one hot and phoible shared embeddings

Train ngram models

You can train ngram models with the following commands:

$ python learn_layer/train_ngram.py --model ngram
$ python learn_layer/train_unigram.py --model unigram

$ python learn_layer/train_ngram_cv.py --model ngram
$ python learn_layer/train_unigram_cv.py --model ngram

Model can be:

ngram: ngram model by default is a trigram
unigram: Unigram model

Train models on artificial data

You can train models on aritificial data using the commands:

$ python learn_layer/train_artificial.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_bayes.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_cv.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_ngram.py --model ngram --artificial-type <artificial-type>

Artificial type can be:

harmony: Artificial data with vowel harmony removed;
devoicing: Artificial data with final obstruent devoicing removed.

Train all models

You can also call a script to train all models sequentially (it might take a while):

$ source learn_layer/train_multi.sh

Plot Results

Get compiled result data:

$ python analysis_layer/compile_results.py
$ python analysis_layer/get_lang_inventory.py

Plot all results with commands:

$ mkdir plot
$ python visualization_layer/plot_lstm.py
$ python visualization_layer/plot_full.py
$ python visualization_layer/plot_inventory.py
$ python visualization_layer/plot_kde.py
$ python visualization_layer/plot_artificial_scatter.py

Extra Information

Citation

If this code or the paper were usefull to you, consider citing it:

@article{pimentel-etal-2020-phonotactics,
    title={Phonotactic Complexity and its Trade-offs},
    author={Pimentel, Tiago and Roark, Brian and Cotterell, Ryan},
    journal={Transactions of the Association for Computational Linguistics},
    volume={8},
    pages={1--18},
    year={2020},
    publisher={MIT Press},
    doi={10.1162/tacl\_a\_00296},
    url={https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00296}
}

Dependencies

This project was tested with libraries:

numpy==1.15.4
pandas==0.24.1
scikit-learn==0.20.2
tqdm==4.31.1
matplotlib==2.0.2
seaborn==0.9.0
torch==1.0.1.post2

Contact

To ask questions or report problems, please open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
analysis_layer		analysis_layer
config		config
data_layer		data_layer
datasets		datasets
learn_layer		learn_layer
util		util
visualization_layer		visualization_layer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

tpimentelms/phonotactic-complexity

Folders and files

Latest commit

History

Repository files navigation

phonotactic-complexity

Install Dependencies

Parse data

Train models

Train base models

Train shared models

Train ngram models

Train models on artificial data

Train all models

Plot Results

Extra Information

Citation

Dependencies

Contact

About

Resources

License

Stars

Watchers

Forks

Languages