Check for numbers in sentences can be switched off #100

g3n35i5 · 2019-07-18T08:00:08Z

If the CorporaCreator is used with data in which it is valid that
sentences contain numbers, there should be a way to allow them.

With the optional command line parameter "-c" this check can now be
skipped.

Usage:

create-corpora [other args] -c {true, false, t, f 0, 1, y, n, yes, no}

If the CorporaCreator is used with data in which it is valid that sentences contain numbers, there should be a way to allow them. With the optional command line parameter "-c" this check can now be skipped. Usage: create-corpora [other args] -c {true, false, t, f 0, 1, y, n, yes, no}

kdavis-mozilla · 2019-07-18T08:32:51Z

If the CorporaCreator is used with data in which it is valid that
sentences contain numbers, there should be a way to allow them.

When is it valid to have numbers?

g3n35i5 · 2019-07-18T08:39:26Z

I am currently working with a phonetic transcription of the CommonVoice data set. For most SpeechRecognition tools, each symbol can only be one character long, which is why, for example, I coded characters like "a:" with numbers and specified them in the alphabet.

kdavis-mozilla · 2019-07-18T08:43:02Z

What about unicode?

g3n35i5 · 2019-07-18T09:20:01Z

The phonetic characters are already unicode characters. However, in phonetic transcription there are symbols that are composed of several. Of course I could simply replace the numbers with other characters, but I have chosen this representation in my scripts, which preprocess the data, and sorted out invalid sentences in advance.

If the CorporaCreator should only be there to process orthographic sentences, this feature does not have to be merged. But if you want to have the possibility to process data of other forms, I think this should be a feature.

kdavis-mozilla · 2019-07-18T09:41:31Z

Ugh. For some reason I though things like "kː" were single code points.

Generally, however, the CorporaCreator is designed to be used to process orthographic sentence.

So beyond your use case I'm not sure the suggested command line options would be of use. And
honestly, I worry that the feature would be abused by people who don't have a valid use case, like
yours, to pollute the corpora with numbers.

g3n35i5 · 2019-07-18T09:46:13Z

So beyond your use case I'm not sure the suggested command line options would be of use.

You might want to issue a warning to the user that he should only use this feature if he is sure what he is doing. After all, it is an optional parameter.

I worry that the feature would be abused by people who don't have a valid use case, like
yours, to pollute the corpora with numbers.

However, if you have any concerns, I can fully understand that. If you want to close the PR, you are welcome to do so.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for numbers in sentences can be switched off #100

Check for numbers in sentences can be switched off #100

g3n35i5 commented Jul 18, 2019

kdavis-mozilla commented Jul 18, 2019

g3n35i5 commented Jul 18, 2019

kdavis-mozilla commented Jul 18, 2019

g3n35i5 commented Jul 18, 2019

kdavis-mozilla commented Jul 18, 2019

g3n35i5 commented Jul 18, 2019

Check for numbers in sentences can be switched off #100

Are you sure you want to change the base?

Check for numbers in sentences can be switched off #100

Conversation

g3n35i5 commented Jul 18, 2019

kdavis-mozilla commented Jul 18, 2019

g3n35i5 commented Jul 18, 2019

kdavis-mozilla commented Jul 18, 2019

g3n35i5 commented Jul 18, 2019

kdavis-mozilla commented Jul 18, 2019

g3n35i5 commented Jul 18, 2019