StatLing

This is an implementation of a statistical cognate detection and tree inference approach. Instead of an absolute judgement of cognacy a posterior probability can be retrieved. The parameter estimation and the estimation of posterior cognate judgements is done using a Markov Chain Monte Carlo framework. The parameters of the model are estimated through a combined scheme of random walk and slice sampling moves.

Dependencies

Python 2.7
numpy
pandas
dendropy
scipy
cython
igraph (python-igraph)
mpi4py

Installation

In order to use the module some C code has to be compiled to make it accessible to the program.

Navigate to src\C_Extensions and then compile the C-code:

gcc -c fw_log.c
ar -rcs libfw_log.a fw_log.o

To install the remaining C functionality run

python setup.py build_ext --inplace

This installs the C-functionality locally for this module.

If you want to enable optimizations you can do so in the setup file.

Usage

There are some places where you can set parameters for the MCMC estimation.

A settings file which should be supplied to the main file using the -s flag.
The folder src\SamplerSettings\ contains a file Prior.py which contains information about the prior distribution and the sampling widths. The file Constraint_Tree.py contains information about tree constraints. Please refer to the two files for information about their exact usage.
The folder src\Utils contains the file ltrees.py which contains a tree in newick format. The data type of this tree is a tuple. This tree is used as a seed.

Flags in the settings file

The settings file consists of several flags which set some options for the MCMC engine.

[Seed] - Specify the seed for the random number generator
[Iterations] - Specify the number of iterations the MCMC runs
[Thinning] - Thinning parameter for the MCMC
[Data] - Path to the data file
[Output] - Path were the output is stored
[ldn] - exclude word pairs who's normalized Levenshtein distance is below this threshold
[MC3] - should the MC3 parameter estimation scheme be used. (requires MPI)
[concept_col] - name of the column in the data file where the information about the concepts is stored
[lang_col] - name of the column where the language is stored
[transcription_col] - name of the column where the word is stored
[Sound Model] - string specifying the Sound Model in use, currently only supports "asjp"
[consistency_checker] - Should the consistency of the tree with predefined constraints be checked
[Transition_Model] - Specifies the transition model for the alignment model. Currently only supports "KM03"
[bottom-up] - Still here for legacy reasons. Must be set to false. Will probably be deleted.
[randomize] - Should the starting point be randomized
[randomize steps] - Number of randomization steps
[folder] - Specify a folder where information about an explicit starting point is stored.

This is an example of how the settings file should look like.

[Seed]
42

[Iterations]
100

...

The Data File

The data file should be in .tsv format, i.e. tab separated. For example:

iso_code	gloss	transcription
ger	eye	ai
fra	eye	Ey
...	...	...

In the settings file the values for [concept_col] would be "gloss", [lang_col] would be set to "iso_code" and [transcription_col] to "transcription".

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
README.md		README.md
StatAlignHist.py		StatAlignHist.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

README.md

README.md

StatAlignHist.py

StatAlignHist.py

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

StatLing

Dependencies

Installation

Usage

Flags in the settings file

The Data File

About

Releases

Packages

Languages

erathorn/StatLing

Folders and files

Latest commit

History

Repository files navigation

StatLing

Dependencies

Installation

Usage

Flags in the settings file

The Data File

About

Topics

Resources

Stars

Watchers

Forks

Languages