Skip to content

tpimentelms/phonotactic-complexity

Repository files navigation

phonotactic-complexity

This repository contains code for analysing phonotactics. It accompanies the paper: Phonotactic Complexity and Its Trade-offs (Pimentel et al., TACL 2020).

It is a study about languages phonotactics and how it relates to other language features, such as word length.

Install Dependencies

Create a conda environment with

$ source config/conda.sh

And install your appropriate version of PyTorch.

Parse data

First, download NorthEuraLex data from this link and put it in the datasets/northeuralex folder. Then, parse it using the following command:

$ python data_layer/parse.py --data northeuralex

Train models

Train base models

You can train the base models (without shared embeddings) with the commands:

$ python learn_layer/train_base.py --model <model> [--opt]
$ python learn_layer/train_base_bayes.py --model <model>
$ python learn_layer/train_base_cv.py --model <model> [--opt]

Different commands are:

  • train_base: Trains model with default data split;
  • train_base_bayes: Trains model using bayesing optimization and default data split;
  • train_base_cv: Trains cross validated models.

Model can be:

  • lstm: LSTM with default one hot embeddings
  • phoible: LSTM with phoible embeddings
  • phoible-lookup: LSTM with both one hot and phoible embeddings

And --opt is an optional parameter that tells the script to use bayes optimized hyper-parameters. It can only be used after training model with train_base_bayes.

Train shared models

You can train models with shared embeddings using the commands:

$ python learn_layer/train_shared.py --model <model> [--opt]
$ python learn_layer/train_shared_bayes.py --model <model>
$ python learn_layer/train_shared_cv.py --model <model> [--opt]

Model can be:

  • shared-lstm: LSTM with shared one hot embeddings
  • shared-phoible: LSTM with shared phoible embeddings
  • shared-phoible-lookup: LSTM with both one hot and phoible shared embeddings

Train ngram models

You can train ngram models with the following commands:

$ python learn_layer/train_ngram.py --model ngram
$ python learn_layer/train_unigram.py --model unigram

$ python learn_layer/train_ngram_cv.py --model ngram
$ python learn_layer/train_unigram_cv.py --model ngram

Model can be:

  • ngram: ngram model by default is a trigram
  • unigram: Unigram model

Train models on artificial data

You can train models on aritificial data using the commands:

$ python learn_layer/train_artificial.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_bayes.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_cv.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_ngram.py --model ngram --artificial-type <artificial-type>

Artificial type can be:

  • harmony: Artificial data with vowel harmony removed;
  • devoicing: Artificial data with final obstruent devoicing removed.

Train all models

You can also call a script to train all models sequentially (it might take a while):

$ source learn_layer/train_multi.sh

Plot Results

Get compiled result data:

$ python analysis_layer/compile_results.py
$ python analysis_layer/get_lang_inventory.py

Plot all results with commands:

$ mkdir plot
$ python visualization_layer/plot_lstm.py
$ python visualization_layer/plot_full.py
$ python visualization_layer/plot_inventory.py
$ python visualization_layer/plot_kde.py
$ python visualization_layer/plot_artificial_scatter.py

Extra Information

Citation

If this code or the paper were usefull to you, consider citing it:

@article{pimentel-etal-2020-phonotactics,
    title={Phonotactic Complexity and its Trade-offs},
    author={Pimentel, Tiago and Roark, Brian and Cotterell, Ryan},
    journal={Transactions of the Association for Computational Linguistics},
    volume={8},
    pages={1--18},
    year={2020},
    publisher={MIT Press},
    doi={10.1162/tacl\_a\_00296},
    url={https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00296}
}

Dependencies

This project was tested with libraries:

numpy==1.15.4
pandas==0.24.1
scikit-learn==0.20.2
tqdm==4.31.1
matplotlib==2.0.2
seaborn==0.9.0
torch==1.0.1.post2

Contact

To ask questions or report problems, please open an issue.

About

Studying phonotactics and how it relates to other language features

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published