Skip to content

Latest commit

 

History

History
109 lines (77 loc) · 5.62 KB

I18N.md

File metadata and controls

109 lines (77 loc) · 5.62 KB

Internationalization

We are extending Genie to support multiple languages and currently are working on Chinese. If you'd like to contribute the internationalization of Genie, please read the following instructions.

Typically, the internationalization is based on the original language - English. What to do is to translate it into a target language. We demonstrate how to build a Traditional Chinese dataset from the English one.

Construct Template

Files you need to modify

genie-toolkit
├── languages
    ├── en/aggregation.genie    -> zh-tw/aggregation.genie
    ├── en/constants.genie      -> zh-tw/constants.genie
    ├── en/filters.genie        -> zh-tw/filters.genie
    ├── en/parameters.genie     -> zh-tw/parameters.genie
    ├── en/thingtalk.genie      -> zh-tw/thingtalk.genie
    └── en/timers.genie         -> zh-tw/timers.genie
└── lib
    └── i18n
        └── american-english.js -> traditional-chinese.js

Add the option for the target language in lib/i18n/index.js.

/languages/*.genie

Brief introduction for each files above :

  1. thingtalk.genie All other #.genie files are include in this file.
  2. aggregation.genie This file handles aggregation operation in ThingTalk language, such as MIN, MAX, etc. We define the natural language construction roles for those operation here.
  3. filters.genie This file handles operation with filter. For example, 'news with someone in title', '...with some one' is the filter in this case. We define the natural language construction roles here.
  4. parameters.genie This file handles how parameters are passed between thingtalk program. In thingtalk, we can pass the previous program's output parameters to next program as its input parameters. We define construction rules here to show how to express parameters passing in natural language.
  5. timers.genie This file handles cases like 'every day', 'once a week', etc. We define how to synthetic sentence with those phrases.

These construct templates might be different for different languages, do not just copy and translate. Modify or add construct templates here according to your demand.

For example, in English we can say "Give me something with some conditions". However, we don't have this kind of utterence in Chinese, we would say "給我符合某條件的某東西". Thus we would remove this templates for Chinese and add our own templates.

/lib/i18n/your_language.js

We define some post processes after generating sentences here. You can copy and modfiy or add more. Do not remove things that are already there although you might not need it. Just remove the contents and keep the wrappers.

/lib/i18n/index.js

Finally, Add the option for the target language in lib/i18n/index.js.

Dataset Translation

To add a new language, the dataset of the target language is needed.

Step 1. Download a source ThingTalk dataset.

Download the dataset from Thingpedia. It contains the primitive sentences of all devices uploaded on the Thingpedia. Datasets of other languages will be available in the future.

genie download-dataset -o dataset.en-US.tt

Step 2. Clean the dataset.

We recommend you clean the dataset before translating. The downloaded dataset includes not only utterances to translate but also preprocessed, id, click_count, and like_count. You will definitely feel better if you don't see these extra information when translating the utterances.

genie dataset -i dataset.en-US.tt -o dataset.raw.en-US.tt -l zh-tw --thingpedia thingpedia.tt --actions clean

Modify the language tag in the first line of the dataset. Note that the language tag here does not equal to the locale. E.g. English is "en"; Chinese is "zh"; etc. The language tag will be recognized by the Almond tokenizer later.

dataset @org.thingpedia.dynamic.everything language "zh" {
  ...
}

Rename the dataset file as dataset.raw.zh-tw.tt.

Step 3. Translate all utterances in the dataset manually.

This part requires a lot of hard work. You can translate them all yourself or you might want to hire some part-time workers to do it.

Step 4. Preprocess the translated dataset.

The last step is to preprocess all translated sentences before Genie sentence synthesis. We unify the letter case in this action. And for languages such as Chinese, Japanese, and Korean, sentences are segmented into words by the Almond tokenizer. It adds the preprocessed field, which is essential for generating synthesized sentences, to each command in the translated dataset.

The Stanford Almond Tokenizing Service recognize only English for the time being. So you need to setup a tokenizer locally. See here. Then set the environmental variable GENIE_USE_TOKENIZER to be local.

export GENIE_USE_TOKENIZER=local

After the Almond tokenizer listening on http://localhost:8888, you can preprocess the sentences.

genie dataset -i dataset.raw.zh-tw.tt -o dataset.zh-tw.tt -l zh-tw --thingpedia thingpedia.tt --actions preprocess

The Almond tokenizer does not support Traditional Chinese and perform poorly on it. So our workaround is to convert sentences from Tranditional Chinese to Simplified Chinese first, segment them, and convert them back.

Step 5. Synthesize sentences with Genie.

Then you can start to synthesize sentences from your translated dataset!

genie generate --locale zh-tw --template languages/zh-tw/thingtalk.genie
  --thingpedia thingpedia.tt --entities entities.json --dataset dataset.zh-tw.tt
  -o synthetic.zh-tw.tsv