Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for numbers in sentences can be switched off #100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

g3n35i5
Copy link

@g3n35i5 g3n35i5 commented Jul 18, 2019

If the CorporaCreator is used with data in which it is valid that
sentences contain numbers, there should be a way to allow them.

With the optional command line parameter "-c" this check can now be
skipped.

Usage:

create-corpora [other args] -c {true, false, t, f 0, 1, y, n, yes, no}

If the CorporaCreator is used with data in which it is valid that
sentences contain numbers, there should be a way to allow them.

With the optional command line parameter "-c" this check can now be
skipped.

Usage:

create-corpora [other args] -c {true, false, t, f 0, 1, y, n, yes, no}
@kdavis-mozilla
Copy link
Contributor

If the CorporaCreator is used with data in which it is valid that
sentences contain numbers, there should be a way to allow them.

When is it valid to have numbers?

@g3n35i5
Copy link
Author

g3n35i5 commented Jul 18, 2019

I am currently working with a phonetic transcription of the CommonVoice data set. For most SpeechRecognition tools, each symbol can only be one character long, which is why, for example, I coded characters like "a:" with numbers and specified them in the alphabet.

@kdavis-mozilla
Copy link
Contributor

What about unicode?

@g3n35i5
Copy link
Author

g3n35i5 commented Jul 18, 2019

The phonetic characters are already unicode characters. However, in phonetic transcription there are symbols that are composed of several. Of course I could simply replace the numbers with other characters, but I have chosen this representation in my scripts, which preprocess the data, and sorted out invalid sentences in advance.

If the CorporaCreator should only be there to process orthographic sentences, this feature does not have to be merged. But if you want to have the possibility to process data of other forms, I think this should be a feature.

@kdavis-mozilla
Copy link
Contributor

Ugh. For some reason I though things like "kː" were single code points.

Generally, however, the CorporaCreator is designed to be used to process orthographic sentence.

So beyond your use case I'm not sure the suggested command line options would be of use. And
honestly, I worry that the feature would be abused by people who don't have a valid use case, like
yours, to pollute the corpora with numbers.

@g3n35i5
Copy link
Author

g3n35i5 commented Jul 18, 2019

So beyond your use case I'm not sure the suggested command line options would be of use.

You might want to issue a warning to the user that he should only use this feature if he is sure what he is doing. After all, it is an optional parameter.

I worry that the feature would be abused by people who don't have a valid use case, like
yours, to pollute the corpora with numbers.

However, if you have any concerns, I can fully understand that. If you want to close the PR, you are welcome to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants