spelled words / names #85

nicolaspanel · 2019-02-16T11:06:48Z

as described in common-voice/sentence-collector#169, a common issue is that some words are spelled (ie letter by letter).

Example: PHP => P H P

what should be the output of CorporaCreator in such situations ?

kdavis-mozilla · 2019-02-16T16:53:30Z

The easy answer is to remove such sentences as abbreviations should not have been included in the texts to read. However, there is also a longer, more painful answer too....

If you wanted to keep all readings of this sentence, then an option might be to listen to each one. (I don't know how many there are as I haven't looked.) If a person says it "correctly" in French, which I assumed you are concerned with, then you can just leave it as is "PHP".

However, some people might have not said it "incorrectly" in French. (I'm not sure how it is said in French, but I'll assume it's like in English with a French accent.) For these people youd have to transform the text to a transcript of which they actually said.

The French preprocessor fr.py, and all other language specific preprocessors, allow you to do this per user transformation of a transcript as it is passed both the transcript in sentence and the ID id the user who said the sentence in client_id.

The design of the language specific preprocessor was made for just this use case, where text "Room 246" could be validly read in different ways, e.g. "Room two four six" or "Room two hundred forty six", by different people and the text would have to be fixed on a user by user basis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spelled words / names #85

spelled words / names #85

nicolaspanel commented Feb 16, 2019

kdavis-mozilla commented Feb 16, 2019

spelled words / names #85

spelled words / names #85

Comments

nicolaspanel commented Feb 16, 2019

kdavis-mozilla commented Feb 16, 2019