setup basic FR preprocessing #87

nicolaspanel · 2019-02-19T14:12:36Z

No description provided.

kdavis-mozilla

Thanks for the commit!

However, before I review the code proper, could you change all the packaging code to adhere to the convention of this repo. In other words using setup.cfg as described here, for example.

requirements.txt

README.rst

nicolaspanel · 2019-02-19T15:24:46Z

thanks @kdavis-mozilla for the review. I've made requested changes

lissyx · 2019-02-19T17:14:51Z

src/corporacreator/preprocessors/fr.py

+
+from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG
+
+FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)")


@nicolaspanel Maybe we shoud include potential spaces? I saw data like "1 er".

@lissyx since their is no such case in clips.tsv I suggest to wait

@lissyx ok for you ?

src/corporacreator/preprocessors/fr.py

kdavis-mozilla

There as several issues, see the comments, that I've questions on and/or need to be addressed.

setup.cfg

src/corporacreator/preprocessors/fr.py

kdavis-mozilla · 2019-02-19T15:54:39Z

src/corporacreator/preprocessors/fr.py

@@ -8,5 +32,9 @@ def fr(client_id, sentence):
    Returns:
      (str): Cleaned up sentence. Returning None or a `str` of whitespace flags the sentence as invalid.
    """
-    # TODO: Clean up fr data
-    return sentence
+    text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS)


Do these always make sense? (See comments above on FR_NORMALIZATIONS.)

I think so.
If not then special cases should be handled using client_id.
@kdavis-mozilla can you think of an example ?

kdavis-mozilla · 2019-02-19T15:58:49Z

src/corporacreator/preprocessors/fr.py

+
+
+FR_NORMALIZATIONS = [
+    [re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'],  # "123 000 …" => "123000 …"


We ideally want to not have digits. That said I'm not sure I understand the motivation for this change.

For example "123 000" might have been read "one hundred twenty three zero zero zero". However, now its changed to "123000" which I doubt would be read as "one hundred twenty three zero zero zero". So we'd introduce a miss-match between the audio and the text.

Are you assuming that replace_numbers() fixes this?

If so, how can replace_numbers() do this accurately as it does no know about splits like "123 000". All it would see is "123000", which, due to the split, may have been pronounced "one hundred twenty three zero zero zero".

Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.

Are you assuming that replace_numbers() fixes this?

yes

For example "123 000" might have been read "one hundred twenty three zero zero zero".

I assume it is not the case and user said cent vingt trois mille.

Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.

thanks @lissyx
It seems that some case were still not properly handled. for example, clips.tsv contains sentences like:

les trois,000 valeur du trésor de Loretto

à ma fille, et dix.000 fr.

Loretto contenait un trésor à peu près de trois,000 liv.

@nicolaspanel Yep, those are actually part of another dataset, that was much less well formatted and some error slipped throuh :/

I know this is not the perfect place to discuss this, but....

I'm wondering if we could save a lot of time by simply having common.py mark as invalid any sentences with digits in them.

It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.

I'd be interested in your opinions

@kdavis-mozilla @nicolaspanel

$ cut -f3 source/fr/validated.tsv | grep '[[:digit:]]' | wc -l
366

I guess we can skip numbers for now, fix it in the CV dataset, and hand-craft the current recording, 366 is not impossible.

Looks like there are a few leftover digits being searched for here.

src/corporacreator/preprocessors/fr.py

src/corporacreator/utils.py

kdavis-mozilla · 2019-02-20T11:03:19Z

src/corporacreator/utils.py

+        try:
+            ee = ''.join(e.split())
+            if int(e) >= 0:
+                newinp = num2words(int(ee), lang=locale)


How can this work in all cases?

For example, "Room 456" can validly be read as "Room four five six" or as "Room four hundred and fifty six" . This code can't catch that.

It is for reasons exactly like this that the client_id is passed to fr() so you can hear what the person said and provide the correct transcript.

here we assume value is not ambiguous. situation like "Room four five six" should have already been handle by maybe_normalize step to produce "Room 4 5 6" instead of original "Room 456"

@kdavis-mozilla is it ok for you ?

Are we allowed to assume we are in a non-ambiguous case?

I don't see how we can assume such without hearing the audio.

@kdavis-mozilla I think in French we should be fine regarding ambigous cases unless for numbers > 1099 <= 9999. Those might (and are often in the case of dates) be spelled by hundreds. But as I said to @nicolaspanel if it's too much work for him and too tricky / edge cases to risk polluting the dataset, then I can dig into clips and listen, later.

src/corporacreator/utils.py

nicolaspanel · 2019-02-20T13:25:18Z

@lissyx @kdavis-mozilla
I won't have time to handle all specific situations.
The point with this PR was to share some basic normalisation code I have.
Lets focus then on remaining issues (uppercase, punctuation, etc). It will still be possible to improve later

nicolaspanel · 2019-02-20T14:37:37Z

@kdavis-mozilla @lissyx I think I've made all requested changes.

lissyx

Not sure about the numbers thing.

lissyx · 2019-02-20T21:47:10Z

src/corporacreator/preprocessors/fr.py

+
+from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG
+
+FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)")


lissyx · 2019-02-20T21:48:55Z

src/corporacreator/preprocessors/fr.py

-    # TODO: Clean up fr data
-    return sentence
+    text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS])
+    text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG)


Just wondering if we should do that now or later: as you shown, my heuristics were not perfect, so maybe it'd be best if I listened to recording and adjust with client_id, and fix Common Voice dataset at the same time ?

As far as I can tell, it work just fine as is (I am also using it in trainingspeech projet).
It is a good idea to pick / listen a few examples to check but checking all the examples will take a lot of time...
Personally, I won't have such bandwidth anytime soon...

That's why I was suggesting that I do it :)

@lissyx why not doing it in another PR ?

@nicolaspanel That's what was implied :)

@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.

I'd be interested in your take on the Draconian idea to have common.py mark as invalid any sentences with digits in them.

@kdavis-mozilla I guess it's not such a bad idea, with a logging mode to help identify and fix the dataset as well.

lissyx · 2019-02-20T21:49:10Z

tests/test_preprocessors.py

@@ -0,0 +1,34 @@
+import pytest


kdavis-mozilla

I've added in a few comments on the code, but I think more than anything I wanted to ask everyone following on this issue about the Draconian idea of having a separate PR that has common.py mark as invalid any sentences with digits in them.

The most complicated part of this PR, and of other PR's in other languages, are the digits. So my thought was to just solve the problem once and for all in all languages. Throw out any sentence with digits.

That said, if a separate PR is made to have common.py mark as invalid any sentences with digits in, then a lot of the code here is not needed.

@lissyx it'd be great to have your feed back too!

kdavis-mozilla · 2019-02-21T13:26:15Z

src/corporacreator/preprocessors/fr.py

+
+FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)")
+
+SPELLED_ACRONYMS = {


If this contains all the acronyms in fr for the current clips.tsv, then we're fine.

kdavis-mozilla · 2019-02-21T13:26:33Z

src/corporacreator/preprocessors/fr.py

+]
+
+
+FR_NORMALIZATIONS = [


Fine here too

kdavis-mozilla · 2019-02-21T13:32:27Z

src/corporacreator/preprocessors/fr.py

+
+
+FR_NORMALIZATIONS = [
+    [re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'],  # "123 000 …" => "123000 …"


I know this is not the perfect place to discuss this, but....

I'm wondering if we could save a lot of time by simply having common.py mark as invalid any sentences with digits in them.

It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.

I'd be interested in your opinions

kdavis-mozilla · 2019-02-21T13:35:26Z

src/corporacreator/preprocessors/fr.py

-    # TODO: Clean up fr data
-    return sentence
+    text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS])
+    text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG)


@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.

I'd be interested in your take on the Draconian idea to have common.py mark as invalid any sentences with digits in them.

kdavis-mozilla · 2019-02-21T13:39:33Z

src/corporacreator/utils.py

+        try:
+            ee = ''.join(e.split())
+            if int(e) >= 0:
+                newinp = num2words(int(ee), lang=locale)


Are we allowed to assume we are in a non-ambiguous case?

I don't see how we can assume such without hearing the audio.

kdavis-mozilla · 2019-02-21T13:48:51Z

@lissyx @nicolaspanel Sorry for my review comment being more about starting a discussion, but I think it's a discussion that needs to happen as complicated digit manipulation in many languages isn't going to happen on a timescale that's useful for a data release. (@nicolaspanel this is not reflective on you as you are the only person who's taken a real shot at doing the digits right!)

kdavis-mozilla · 2019-02-22T07:07:03Z

@nicolaspanel @lissyx I am going to introduce a PR later today to mark as invalid any sentence in any language with digits, see issue 89.

As the release date looms and the number of languages that deal with digits properly is almost zero, I talked with other members of the project to establish that this is the most practical way forward to maintain a high quality data set with a release in the foreseeable future.

As a note, this would increase the number of invalid French sentences by 0.52% which is think is acceptable.

lissyx

LGTM now, let's not loose time on numbers.

kdavis-mozilla · 2019-02-25T12:09:11Z

@nicolaspanel I just merged code that marks all sentences in all languages that have digits as invalid. For French this decreases the number of valid sentences by 0.52% which is think is acceptable.

So could you can remove all code in your PR that deals with digits?

nicolaspanel · 2019-02-26T10:36:15Z

So could you can remove all code in your PR that deals with digits?

@kdavis-mozilla done

I marked related unit tests as skipped since we may want to support them later

kdavis-mozilla

Thanks for removing most of the digits code, but it seems like there are still digit relics in the regular expressions that should be removed.

I think I commented on all of them with "Looks like there are a few leftover digits being searched for here.", but I might have missed one or two.

kdavis-mozilla · 2019-02-26T10:51:36Z

src/corporacreator/preprocessors/fr.py

+
+FR_NORMALIZATIONS = [
+    ['Jean-Paul II', 'Jean-Paul deux'],
+    [re.compile(r'(^|\s)(\d+)T(\s|\.|,|\?|!|$)'), r'\1\2 tonnes\3'],


Looks like there are a few leftover digits being searched for here.

kdavis-mozilla · 2019-02-26T10:51:50Z

src/corporacreator/preprocessors/fr.py

+
+
+FR_NORMALIZATIONS = [
+    [re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'],  # "123 000 …" => "123000 …"


Looks like there are a few leftover digits being searched for here.

kdavis-mozilla · 2019-02-26T10:52:02Z

src/corporacreator/preprocessors/fr.py

+    [re.compile(r'(^|\s)/an(\s|\.|,|\?|!|$)'), r'\1par an\2'],
+    [re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'],  # "123 000 …" => "123000 …"
+    [re.compile(r'(^|\s)km(\s|\.|,|\?|!|$)'), r'\1 kilomètres \2'],
+    [re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'],


Looks like there are a few leftover digits being searched for here.

kdavis-mozilla · 2019-02-26T10:52:27Z

src/corporacreator/preprocessors/fr.py

+    [re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'],
+    ['%', ' pourcent'],
+    [re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'],
+    [re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],


Looks like there are a few leftover digits being searched for here.

kdavis-mozilla · 2019-02-26T10:52:34Z

src/corporacreator/preprocessors/fr.py

+    [re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],
+    [re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'],
+    [re.compile(r'/\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r' par mètre carré\1'],
+    [re.compile(r'(^|\s)(\d+),(\d{2})\s?€(\s|\.|,|\?|!|$)'), r'\1\2 euros \3 \4'],


Looks like there are a few leftover digits being searched for here.

kdavis-mozilla · 2019-02-26T10:57:33Z

src/corporacreator/utils.py

+
+
+FIND_MULTIPLE_SPACES_REG = re.compile(r'\s{2,}')
+FIND_PUNCTUATIONS_REG = re.compile(r"[/°\-,;!?.()\[\]*…—«»]")


In parallel with removal of the above commented out code, I'd ask that this is also removed as it's not used.

kdavis-mozilla · 2019-02-26T10:58:53Z

src/corporacreator/preprocessors/fr.py

+    text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS])
+    # TODO: restore this once we are clear on which punctuation marks should be kept or removed
+    # text = FIND_PUNCTUATIONS_REG.sub(' ', text)
+    text = FIND_MULTIPLE_SPACES_REG.sub(' ', text)


This has no effect as multiple white spaces are already removed here. So it seems like it should be removed.

kdavis-mozilla · 2019-02-26T11:03:05Z

tests/test_preprocessors.py

+from corporacreator import preprocessors
+
+
+@pytest.mark.parametrize('locale, client_id, sentence, expected', [


I'd suggest removing tests that no longer make sense in like of digits being banned. For example

('fr', '*', "donc, ce sera 299 € + 99 €", "donc, ce sera deux cent quatre-vingt-dix-neuf euros plus quatre-vingt-dix-neuf euros"),

Some of the tests here, for example

('fr', '*', "Jean-Paul II.", "Jean-Paul deux.")

have nothing to do with digits and can actually be run independent of this comment.

kdavis-mozilla · 2019-02-26T11:06:14Z

src/corporacreator/preprocessors/fr.py

+    ['%', ' pourcent'],
+    [re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'],
+    [re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],
+    [re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'],


Looks like there are a few leftover digits being searched for here.

kdavis-mozilla · 2019-02-26T11:06:23Z

src/corporacreator/preprocessors/fr.py

+    [re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'],
+    [re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],
+    [re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'],
+    [re.compile(r'/\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r' par mètre carré\1'],


Looks like there are a few leftover digits being searched for here.

nicolaspanel · 2019-02-26T11:52:33Z

Looks like there are a few leftover digits being searched for here.

@kdavis-mozilla you're right, sorry I missed it. fixed in 149e960

kdavis-mozilla

Assuming my eyes are parsing the regex's correctly, it looks like there's still some regexs that deal with digits.

For example

 [re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2']

Also there is some "dead code"

text = FIND_MULTIPLE_SPACES_REG.sub(' ', text)

that's "dead" as a result of code in common.py

kdavis-mozilla suggested changes Feb 19, 2019

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

README.rst Outdated Show resolved Hide resolved

README.rst Outdated Show resolved Hide resolved

nicolaspanel force-pushed the master branch from 824bf74 to a41db94 Compare February 19, 2019 15:21

setup basic FR preprocessing

3933cd0

nicolaspanel force-pushed the master branch from a41db94 to 3933cd0 Compare February 19, 2019 15:22

nicolaspanel added 2 commits February 19, 2019 17:45

improve FR normalization

2cd6595

improve FR normalization

f4156ff

lissyx suggested changes Feb 19, 2019

View reviewed changes

improve accronyms detection

600715d

kdavis-mozilla suggested changes Feb 20, 2019

View reviewed changes

nicolaspanel added 3 commits February 20, 2019 13:36

rm unecessary normalisation (already in common preprocessor)

7c34361

handle "xxx/m2" as "xxx per m2"

40fe406

raise exception on invalid normalization mapping

fe6d5b9

nicolaspanel added 3 commits February 20, 2019 15:19

keep UTF-8 quotes

934ec38

do not lowercase

5b1a6fa

restore punctuation

168a555

lissyx suggested changes Feb 20, 2019

View reviewed changes

kdavis-mozilla suggested changes Feb 21, 2019

View reviewed changes

lissyx approved these changes Feb 25, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master'

7f37c78

nicolaspanel force-pushed the master branch from e6d2ff4 to 85216e5 Compare February 26, 2019 10:38

ignore digit related sentences

5090d1c

nicolaspanel force-pushed the master branch from 85216e5 to 5090d1c Compare February 26, 2019 10:39

kdavis-mozilla suggested changes Feb 26, 2019

View reviewed changes

rm digit related regexes

149e960

nicolaspanel force-pushed the master branch from ac93d40 to 149e960 Compare February 26, 2019 11:50

kdavis-mozilla suggested changes Feb 28, 2019

View reviewed changes

lissyx mentioned this pull request Mar 5, 2019

Corrections diverses sur Corpora Creator common-voice/commonvoice-fr#20

Open

3 tasks

drzraf mentioned this pull request Sep 10, 2022

Rejet des abbréviations common-voice/commonvoice-fr#21

Open

2 tasks

CapitainFlam mentioned this pull request Sep 13, 2022

[WIP] additionnal lib/cleanup for French language to improve quality of inputs common-voice/sentence-collector#635

Closed


		from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG

		FIND_ORDINAL_REG = re.compile(r"(\d+)([ème\|éme\|ieme\|ier\|iere]+)")



		FR_NORMALIZATIONS = [
		[re.compile(r'(^\|\s)(\d+)\s(0{3})(\s\|\.\|,\|\?\|!\|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …"


		FIND_ORDINAL_REG = re.compile(r"(\d+)([ème\|éme\|ieme\|ier\|iere]+)")

		SPELLED_ACRONYMS = {



		FIND_MULTIPLE_SPACES_REG = re.compile(r'\s{2,}')
		FIND_PUNCTUATIONS_REG = re.compile(r"[/°\-,;!?.()\[\]*…—«»]")

		from corporacreator import preprocessors


		@pytest.mark.parametrize('locale, client_id, sentence, expected', [

setup basic FR preprocessing #87

Are you sure you want to change the base?

setup basic FR preprocessing #87

Conversation

nicolaspanel commented Feb 19, 2019

kdavis-mozilla left a comment

Choose a reason for hiding this comment

nicolaspanel commented Feb 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdavis-mozilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaspanel Feb 20, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaspanel commented Feb 20, 2019 • edited

nicolaspanel commented Feb 20, 2019

lissyx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdavis-mozilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdavis-mozilla commented Feb 21, 2019 • edited

kdavis-mozilla commented Feb 22, 2019

lissyx left a comment

Choose a reason for hiding this comment

kdavis-mozilla commented Feb 25, 2019

nicolaspanel commented Feb 26, 2019

kdavis-mozilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaspanel commented Feb 26, 2019

kdavis-mozilla left a comment

Choose a reason for hiding this comment

nicolaspanel Feb 20, 2019 •

edited

nicolaspanel commented Feb 20, 2019 •

edited

kdavis-mozilla commented Feb 21, 2019 •

edited