punkProse

Punctuation generation for speech transcripts using lexical, syntactic and prosodic features.

Modification on forked repository (by reducing training to one stage and addition of more word-level prosodic features). This version lets use any combination of word-aligned features.

Prosodically annotated files are in proscript format (https://github.com/alpoktem/proscript). For example data and extraction scripts see: https://github.com/alpoktem/ted_preprocess

How does it perform?

English punctuation model was trained from a prosodically annotated TED corpus consisting of 1038 talks (155174 sentences). Link to dataset: http://hdl.handle.net/10230/33981

Punctuation generation accuracy with respect to human transcription:

PUNCTUATION	PRECISION	RECALL	F-SCORE
Comma (,)	61.3	48.9	54.4
Question Mark (?)	71.8	70.6	71.2
Period (.)	82.6	83.5	83.0
Overall	73.7	67.3	70.3

These scores are obtained with a model trained with leveled pause duration and mean f0 features together with word and POS tags.

Example Run

Requirements:
- Python 3.x
- Numpy
- Theano
- yaml

Data directory (path $datadir) should look like the output folder (data/corpus) in https://github.com/alpoktem/ted_preprocess. Vocabularies and sampled training/testing/development sets are stored here.

Sample run explained here is provided in run.sh.

Training

Training is done on sequenced data stored in train_samples under $datadir.

Dataset features to train with are given with the flag -f. Other training parameters are specified through the parameters.yaml file. To train with word, pause, POS and mean f0:

modelId="mod_word-pause-pos-mf0"

python main.py -m $modelId -f word -f pause_before -f pos -f f0_mean -p parameters.yaml

Testing

Testing is done on proscript data using punctuator.py. Either single <input-file> or <input-directory> is given as input using -i or -d respectively. Even if there's punctuation information on this data, it is ignored. Predictions for each file in the $test_samples directory are put into $out_preditions directory. Input files should contain the parameters that the model was trained with.

model_name="Model_single-stage_""$modelId""_h100_lr0.05.pcl"

python punctuator.py -m Model_single-stage_mod_word-pause-pos-mf0_h100_lr0.05.pcl -d $test_samples -o $out_predictions

Scoring the testing output:

Predictions are compared with groundtruth data using error_calculator.py. It either takes two files to compare or two directories containing groundtruth/prediction files. Use -r for reducing punctuation marks.

python error_calculator.py -g $groundtruthData -p $out_predictions -r

Citing

More details can be found in the publication: https://link.springer.com/chapter/10.1007/978-3-319-68456-7_11

This work can be cited as:

@inproceedings{punkProse,
	author = {Alp Oktem and Mireia Farrus and Leo Wanner},
	title = {Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech},
	booktitle = {5th International Conference on Statistical Language and Speech Processing SLSP 2017},
	year = {2017},
	address = {Le Mans, France}
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
audio-samples		audio-samples
LICENSE		LICENSE
README.md		README.md
error_calculator.py		error_calculator.py
main.py		main.py
models.py		models.py
parameters.yaml		parameters.yaml
punctuator.py		punctuator.py
run.sh		run.sh
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio-samples

audio-samples

LICENSE

LICENSE

README.md

README.md

error_calculator.py

error_calculator.py

main.py

main.py

models.py

models.py

parameters.yaml

parameters.yaml

punctuator.py

punctuator.py

run.sh

run.sh

utilities.py

utilities.py

Repository files navigation

punkProse

How does it perform?

Example Run

Training

Testing

Scoring the testing output:

Citing

About

Releases

Packages

Languages

License

alpoktem/punkProse

Folders and files

Latest commit

History

Repository files navigation

punkProse

How does it perform?

Example Run

Training

Testing

Scoring the testing output:

Citing

About

Resources

License

Stars

Watchers

Forks

Languages