Grapheme-to-Phoneme models for Norwegian

Introduction

This repo contains Grapheme-to-Phoneme (G2P) models for Norwegian, to be used with the G2P engine Phonetisaurus. The G2P models can be used to generate pronunciation lexica from word lists. For more information on how to do that, consult the Phonetisaurus repo.

The models in this repo are trained on the Norwegian pronunciation Lexicon for ASR, originally made by the defunct company Nordisk språkteknologi, currently distributed by the National Library of Norway.

Two models have been developed. One is trained on a full version of the lexicon, including phones, marking of primary and secondary stress, and tone. The other is trained on a simplified version where tonal markings and markings of secondary stress are removed.

Content

The folder train/ is too large to store on Github. It can be retrieved as a tar-ball from this address.

train/: Contains the models, as well as auxiliary files used by Phonetisaurus
- model-wtone-nob.fst contains full tone and stress specifications
- model-notone-nob.fst lacks tone and secondary stress
lexica/: Contains various lexica used for training and testing
- NST-total_train.dict is the training set for model-wtone-nob.fst. It contains 612 366 word-transcription pairs (WTP) and constitutes 90% of the unique WTPs in the NST lexicon.
- NST-total_test.dict is the test set for model-wtone-nob.fst. It consists of the remaining 10% of the unique WTPs in the NST lexicon, which have been randomly selected.
- NST-total-notone_nosecstress_train.dict is the training set for model-notone-nob.fst. It is equal to NST-total_train.dict, but markings of tone and secondary stress have been removed
- NST-total-notone_nosecstress_test.dict is the test set for model-notone-nob.fst. It is equal to NST-total_test.dict, but markings of tone and secondary stress have been removed
- NST-total_test_predicted.dict is the test set with tones and secondary stress with transcriptions predicted by the G2P system
- NST-total_test_notone_predicted.dict is the test set without tones and secondary stress with transcriptions predicted by the G2P system g2p_stats.py is the evaluation script used in this project.

Transcription standard

Although the original NST lexicon uses X-SAMPA as a transcription standard, an equivalent standard is used in this project., which is easier to read by humans, NoFAbet. NoFAbet is in part based on 2-letter ARPAbet and is made by Nate Young for the National Library of Norway in connection with the development of NoFA, a forced aligner for Norwegian, soon to be released.

X-SAMPA-NoFAbet equivalence table

X-SAMPA	NoFAbet	Example
A:	AA0	bad
{:	AE0	vær
{	AEH0	vært
{*I	AEJ0	sei
E*u0	AEW0	sau
A	AH0	hatt
A*I	AJ0	kai
@	AX0	behage
b	B	bil
d	D	dag
e:	EE0	lek
E	EH0	penn
f	F	fin
g	G	gul
h	H	hes
I	IH0	sitt
i:	II0	vin
j	J	ja
k	K	kost
C	KJ	kino
l	L	land
l=	LX0
m	M	man
m=	MX0
n	N	nord
N	NG	eng
n=	NX0
o:	OA0	rå
O	OAH0	gått
2:	OE0	løk
9	OEH0	høst
9*Y	OEJ0	køye
U	OH0	f*ort
O*Y	OJ0	konvoy
u:	OO0	bod
@U	OU0	show
p	P	pil
r	R	rose
d`	RD	rekord
l`	RL	perle
l`=	RLX0
n`	RN	barn
n`=	RNX0
s`	SJ	pers
t`	RT	stort
r=	RX0
s	S	sil
S	SJ	sju
s=	SX0
t	T	tid
u0	UH0	russ
u0 j	UH0_J	Anhui
}:	UU0	hus
v	V	vase
w	W	Washington
Y	YH0	nytt
y:	YY0	ny

Unstressed syllables are marked with a 0 after the vowel or consonant syllable nucleus. The nucleus is marked with a 1 for tone 1 and a 2 for tone 2. Secondary stress is marked with 3. In the material without tone and stress marking, all 3s are replaced by zeros and all 2s with 1s.

For compatibility with NoFA, retroflex s is rendered as SJ instead of RS, which means that there is no distinction between postalveolar and retroflex s in the transcriptions.

Evaluation

Model	Word Error Rate	Phoneme Error Rate
model-wtone-nob.fst	14.29	2.76
model-notone-nob.fst	10.44	2.00

The PER calculation is borrowed from this tutorial.

Usage

The models created in this project can be used for any purpose, as long as it is compliant with Phonetisaurus' license.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
lexica		lexica
.gitattributes		.gitattributes
README.md		README.md
g2p_stats.py		g2p_stats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lexica

lexica

.gitattributes

.gitattributes

README.md

README.md

g2p_stats.py

g2p_stats.py

Repository files navigation

Grapheme-to-Phoneme models for Norwegian

Introduction

Content

Transcription standard

X-SAMPA-NoFAbet equivalence table

Evaluation

Usage

About

Releases

Packages

Languages

peresolb/g2p-no

Folders and files

Latest commit

History

Repository files navigation

Grapheme-to-Phoneme models for Norwegian

Introduction

Content

Transcription standard

X-SAMPA-NoFAbet equivalence table

Evaluation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages