Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spelled words / names #85

Open
nicolaspanel opened this issue Feb 16, 2019 · 1 comment
Open

spelled words / names #85

nicolaspanel opened this issue Feb 16, 2019 · 1 comment

Comments

@nicolaspanel
Copy link

as described in common-voice/sentence-collector#169, a common issue is that some words are spelled (ie letter by letter).

Example: PHP => P H P

what should be the output of CorporaCreator in such situations ?

@kdavis-mozilla
Copy link
Contributor

The easy answer is to remove such sentences as abbreviations should not have been included in the texts to read. However, there is also a longer, more painful answer too....

If you wanted to keep all readings of this sentence, then an option might be to listen to each one. (I don't know how many there are as I haven't looked.) If a person says it "correctly" in French, which I assumed you are concerned with, then you can just leave it as is "PHP".

However, some people might have not said it "incorrectly" in French. (I'm not sure how it is said in French, but I'll assume it's like in English with a French accent.) For these people youd have to transform the text to a transcript of which they actually said.

The French preprocessor fr.py, and all other language specific preprocessors, allow you to do this per user transformation of a transcript as it is passed both the transcript in sentence and the ID id the user who said the sentence in client_id.

The design of the language specific preprocessor was made for just this use case, where text "Room 246" could be validly read in different ways, e.g. "Room two four six" or "Room two hundred forty six", by different people and the text would have to be fixed on a user by user basis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants