Skip to content

MIR-MU/fasttext-optimizer

Repository files navigation

FastText Subword Size Optimizer

actions-badge colab-badge

Suggests subword sizes for fastText language models using character n-gram frequency analysis.

Suggesting Subword Sizes

To suggest subword sizes for one or more languages, use the suggest_subword_sizes.sh tool:

$ git clone https://github.com/MIR-MU/fasttext-optimizer.git --recurse-submodules
$ cd fasttext-optimizer
$ TMPDIR=/var/tmp ./suggest_subword_sizes.sh en de cs it

Suggested subword sizes for en: -minn 1 -maxn 5 (4.52% n-gram coverage)
Suggested subword sizes for de: -minn 6 -maxn 6 (4.19% n-gram coverage)
Suggested subword sizes for cs: -minn 1 -maxn 4 (3.28% n-gram coverage)
Suggested subword sizes for it: -minn 6 -maxn 6 (5.11% n-gram coverage)

To see how you can suggest subword sizes in Python, see also our Python tutorial.

Training FastText Models

To train one or more fastText models with the suggested subword sizes, use the train_fasttext_models.sh tool:

$ git clone https://github.com/MIR-MU/fasttext-optimizer.git --recurse-submodules
$ cd fasttext-optimizer
$ TMPDIR=/var/tmp ./train_fasttext_models.sh cs de es fr
$ ls data/wikimedia/wiki.{cs,de,es,fr}.{default,suggested}.{bin,vec}

data/wikimedia/wiki.cs.default.bin    data/wikimedia/wiki.cs.default.vec
data/wikimedia/wiki.cs.suggested.bin  data/wikimedia/wiki.cs.suggested.vec
data/wikimedia/wiki.de.default.bin    data/wikimedia/wiki.de.default.vec
data/wikimedia/wiki.de.suggested.bin  data/wikimedia/wiki.de.suggested.vec
data/wikimedia/wiki.es.default.bin    data/wikimedia/wiki.es.default.vec
data/wikimedia/wiki.es.suggested.bin  data/wikimedia/wiki.es.suggested.vec
data/wikimedia/wiki.fr.default.bin    data/wikimedia/wiki.fr.default.vec
data/wikimedia/wiki.fr.suggested.bin  data/wikimedia/wiki.fr.suggested.vec

Correlating Language Distances

To use suggested subword sizes as a measure of distance between languages and to see how how this measure correlates with other language distance measures, see our Python tutorial.

About

Suggests subword sizes for fastText language models using character n-gram frequency analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published