Skip to content

Waino/morfessor-emprune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Morfessor 2.0 - Quick start
===========================


Installation
------------

Morfessor 2.0 is installed using setuptools library for Python. To
build and install the module and scripts to default paths, type

python setup.py install

For details, see http://docs.python.org/install/


Documentation
-------------

User instructions for Morfessor 2.0 are available in the docs directory
as Sphinx source files (see http://sphinx-doc.org/). Instructions how
to build the documentation can be found in docs/README.

The documentation is also available on-line at http://morfessor.readthedocs.org/

Morfessor EM+Prune
------------------

This branch includes the modifications to Morfessor that enable
training using Expectation Maximization and Pruning.

Morfessor EM+Prune training achieves better Morfessor cost than the
earlier local search algorithm.

A simple usage example ::

    # Create 1M substring seed lexicon direct from a pretokenized corpus
    freq_substr.py --lex-size 1000000 < corpus > freq_substr.1M
    
    # Perform Morfessor EM+Prune training. Autotuning with 10k lexicon size.
    morfessor \
        --em-prune freq_substr.1M \
        -t corpus \
        --num-morph-types 10000 \
        --save-segmentation emprune.model
    
    # Segment data using the Viterbi algorithm
    morfessor-segment \
        testdata \
        --em-prune emprune.model \
        --output segmented.testdata

Additional options for freq_substr.py ::

    --traindata-list
        Training data is a list of word types preceded by counts, not a corpus.

    --prune-redundant "-1"
        Setting prune-redundant to -1 disables pre-pruning of redundant substrings.
        Note the quotes, to prevent the dash from being interpreted as a flag.

    --forcesplit-before XYZ
        Force a splitting point before the characters X, Y and Z
    --forcesplit-after XYZ
        Force a splitting point after the characters X, Y and Z
    --forcesplit-both XYZ
        Force a splitting point both before and after the characters X, Y and Z
        Note that hyphens are NOT force split by default anymore,
        to get the same forcesplitting as Baseline,
        you need to specify --forcesplit-both "-"

Additional options for EM+Prune training ::

    --traindata-list
        Training data is a list of word types preceded by counts, not a corpus.

    --prune-criterion {mdl,autotune,lexicon}
        mdl: (alpha-weighted) Minimum Description Length pruning.
        autotune: MDL with automatic tuning of alpha for lexicon size.
                  If you want a fixed lexicon size, use this.
                  Use --num-morph-types to specify size of lexicon.
        lexicon: lexicon size with omitted prior or pretuned alpha.
                 You probably want "autotune" instead.

    --num-morph-types N
        Goal lexicon size.

    --prune-proportion 0.2
        How large proportion of lexicon to prune in each epoch.

    --em-subepochs 3
        How many sub-epochs of EM to perform.

    --expected-freq-threshold 0.5
        Also prune subwords with expected count less than this.

    --lateen {none,full,prune}
       Lateen EM training mode.
       none: "soft" EM (default)
       full: Lateen-EM
       prune: EM+Viterbi-prune

    --no-bayesianify
        Leave out the Bayesian EM exp digamma transformation of expected counts.

    --no-lexicon-cost
        Omit prior entirely.

    --freq-distr-cost {baseline,omit}
        Frequency distribution prior to use.
        baseline: Approximate Morfessor Baseline prior (default).
        omit: set frequency distribution cost to zero.

    --save-pseudomodel
        use the trained EM+Prune model to segment the training data,
        and save the resulting segmentation as if it was a Morfessor Baseline model.


Additional options for segmentation ::

    --sample-nbest
        Sample alternative segmentations from n-best list.
        Approximates --sample, but is much faster.

    --sample
        Sample from full distribution. You probably want --sample-nbest instead.

    --sampling-temperature 0.5
        (Inverted) temperature parameter for sampling. (1.0 = unsmoothed)

A note on pretokenization and boundary markers ::

Morfessor EM+Prune is typically used with *word* boundary markers (marks where the whitespace should go), rather than the *morph*       boundary markers (marks word-internal boundaries) used by previous Morfessors.
Make sure that the word boundary markers are present in the corpus / word count lists used for Morfessor EM+Prune training, and also in
the input to Morfessor EM+Prune during segmentation.
Some ways to achieve this is to use the pyonmttok library with spacer_annotate=True and joiner_annotate=False,
or the dynamicdata dataloader with pretokenize=True.
This will insert '▁' (unicode lower one eight block \u2581) as word boundary markers.
Also remember to adjust your detokenization post-processing script appropriately.


Contact
-------

Questions or feedback? Email: morpho@aalto.fi


Citing
------

If you use the Morfessor EM+Prune training algorithm, please cite

@inproceedings{gronroos2020morfessor,
    title={Morfessor {EM+Prune}: Improved Subword Segmentation with Expectation Maximization and Pruning},
    author = {Gr{\"o}nroos, Stig-Arne and Sami Virpioja and Mikko Kurimo},
    year = {2020},
    month = {may},
    address = {Marseilles, France},
    booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
    publisher = {ELRA},
}

ArXiv preprint available online at

https://arxiv.org/abs/2003.03131


For the original Morfessor 2.0: Python implementation, please cite

@techreport{virpioja2013morfessor,
    address = {Helsinki, Finland},
    type = {Report},
    title = {Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
    language = {eng},
    number = {25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY},
    institution = {Department of Signal Processing and Acoustics, Aalto University},
    author = {Virpioja, Sami and Smit, Peter and Grönroos, Stig-Arne and Kurimo, Mikko},
    year = {2013},
    pages = {38}
}

The report is available online at 

http://urn.fi/URN:ISBN:978-952-60-5501-5