Code switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation

Ching-Ting Chang, Shun-Po Chuang, Hung-Yi Lee

Interspeech 2019

Abstract

Code-switching is about dealing with alternative languages in speech or text. It is partially speaker-depend and domain-related, so completely explaining the phenomenon by linguistic rules is challenging. Compared to most monolingual tasks, insufficient data is an issue for code-switching. To mitigate the issue without expensive human annotation, we proposed an unsupervised method for code-switching data augmentation. By utilizing a generative adversarial network, we can generate intra-sentential code-switching sentences from monolingual sentences. We applied proposed method on two corpora, and the result shows that the generated code-switching sentences improve the performance of code-switching language models.

Outline

Introduction
Methodology
Experimental setup
- Corpora
- Model Setup
Results
- Code-switching Point Prediction
- Generated Text Quality
- Language Modeling
- Examples
Conclusion

Corpora

LectureSS: The recording of “Signal and System” (SS) course by one Tai-wanese instructor at National Taiwan University in 2006.
SEAME: South East Asia Mandarin-English, a conversational speech by Singapore and Malaysia speakers with almost balanced gender in Nanyang Technological University and Universities Sains Malaysia.

Experimental setup

Prerequisites

Python packages
- python 3
- keras 2
- numpy
- jieba
- h5py
- tqdm
Data
- text files
  - Training set
    1. corpus/XXX/text/train.mono.txt: Mono sentences in H
    2. corpus/XXX/text/train.cs.txt: CS sentences
  - Development set
    1. corpus/XXX/text/dev.mono.txt: Mono sentences in H translated from CS sentences (aligned to 2.)
    2. corpus/XXX/text/dev.cs.txt: CS sentences
  - Testing set
    1. corpus/XXX/text/test.mono.txt: Mono sentences in H
  - Note
    - Sentences should be segmented into words by space.
    - Words are based on H language
    - If a word in H language is mapped to a phrase in G language, we use dash to connect the words into one word.
- local/XXX/translator.txt: Translating table from H language to G language
- local/XXX/dict.txt: Word list for traning word-embedding
- local/postag.txt: POS tag list for traning pos-embedding

Type	Example
CS	Causality 這個也是你所讀過的就是指我 output at-any-time 只 depend-on input
Mono from CS in H	因果性這個也是你所讀過的就是指我輸出在任意時間只取決於輸入

Note
- Mono: monolingual
- CS: code-switching
- H: host (language)
- G: guest (language)
- ASR: automatic speech recognition

Preprocess Data

Use Jieba to get the part-of-speech (POS) tagger of text files for proposed + POS
- Path:
  - Training set
    1. corpus/XXX/pos/train.mono.txt: POS of Mono sentences of training set
    2. corpus/XXX/pos/train.cs.txt: POS of CS sentences of training set
  - Development set
    1. corpus/XXX/pos/dev.mono.txt: POS of Mono sentences of development set set
  - Testing set
    1. corpus/XXX/pos/test.mono.txt: POS of Mono sentences of testing set

Train Model

Results

Baselines:
- ZH
- EN
- Random
- Noun

Code-switching Point Prediction

Precision
Recall
F-measure
BLEU-1
Word Error Rate (WER)

Generated Text Quality

Prerequisites

Installation
- srilm
  - installation tutorial on MacOS

N-gram model
Recurrent Neural Networks based Language Model (RNNLM)

Language Modeling

Automatic Speech Recognition

It's the extended experiment which is not shown in paper.

Prerequisites

Installation
- kaldi
- srilm
Data
- speech wav files & its text files

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
corpus/sample/text		corpus/sample/text
docs		docs
local		local
tool		tool
README.md		README.md
build_model.py		build_model.py
generate.py		generate.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

ChingtingC/Code-Switching-Sentence-Generation-by-GAN

Folders and files

Latest commit

History

Repository files navigation

Code switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation

Abstract

Outline

Corpora

Experimental setup

Prerequisites

Preprocess Data

Train Model

Results

Code-switching Point Prediction

Generated Text Quality

Prerequisites

Language Modeling

Automatic Speech Recognition

Prerequisites

About

Topics

Resources

Stars

Watchers

Forks

Languages