[fr] Ignore words starting with capital letter in "VERBE_SUIVI_D_UN_NOM" ruleset #10422

Sharcoux · 2024-03-19T15:21:43Z

Fixup for VERBE_SUIVI_D_UN_NOM based on the last diff.

We will fix 2 things:

Words starting with a capital letter should be ignored
Words that can be both a noun and something else should be ignored

I noticed that the rule is still triggered when there is a newline character between the 2 tokens. I don't know how to ignore that. For now, the check about capital letters should do the trick, but that's not ideal.

Sharcoux · 2024-03-21T13:24:09Z

The disambiguation processor seem to have many flaws with this kind of structure. Anyway, this rule will just ignore the mistake when the disambiguation is not good, and when the processor will improve, it will correctly catch them all.

For the record, an example of disambiguation problem:

"Il prend café"

Token	Lemma	Part-of-speech
Il	il	R pers suj 3 m s
prend	prendre	V ind pres 3 s
café	café	J e sp / N m s

There is no way that café could be an adjective. There is not even a single name in the sentence...

"Il prend pelle"

Token	Lemma	Part-of-speech
Il	il	R pers suj 3 m s
prend	prendre	V ind pres 3 s
pelle	pelle / peller	N f s / V imp pres 2 s / V ind pres 1 s / V ind pres 3 s / V sub pres 1 s / V sub pres 3 s

How can a conjugated verb follow a conjugated verb that is not an auxiliary? What could be the subject of "V sub pres 1 s"? That doesn't seem to make any sense.

Sharcoux · 2024-03-21T13:28:12Z

This is the continuation of the PR: #10385

I'll keep tracking down the diffs but we should be close to the end.

Sharcoux · 2024-03-21T13:30:39Z

@jaumeortola This is ready for merge

Sharcoux · 2024-04-08T08:53:03Z

@jaumeortola @LucieSteib Can we merge this? So I can check if every case is 100% handled?

LucieSteib · 2024-04-08T09:24:50Z

languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml

@@ -116344,7 +116351,7 @@ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA
                <suggestion>un \2</suggestion>
                <suggestion>une \2</suggestion>
                <suggestion>des \2</suggestion>
-                <example correction="un rhume|une rhume|des rhume">J'attrape <marker>rhume</marker>.</example>
+                <example correction="un lapin|une lapin|des lapin">J'attrape <marker>lapin</marker>.</example>


question: I still don't get why it's not possible to get the right suggestion here (feminine or plural forms...)

I would love it if it was possible. Do you know how? I thought that LanguageTool had no solution for that.

Basically, we have a verb followed by a name. The easiest and one of the likeliest solution is to add the missing article. But we need to know the gender and number of the name in order to generate the correct article. But I don't think that the suggestions can be conditional to the gender or number of that name. At least I found nothing of the sort in the documentation and ChatGPT didn't know one either...

Well, to be fair, the model is kinda good at doing this. Maybe you could check if some of the example here are not detected by the model?

Sharcoux · 2024-04-08T14:23:56Z

Ok, the original problem I was trying to solve was with the sentence: L'eau entre par la bouche du poisson et ressort par les opercules.. Also, I notice that Il prend pelle. or Il prend café. won't be flagged as wrong. However, it's true that many mistakes seem to be correctly identified by the rule "AI_FR_GGEC_MISSING_DETERMINER". But what is this rule exactly and where is it coming from? Are you sure that it is available in this project? I couldn't find anything about it in the repo.

LucieSteib · 2024-04-08T15:31:34Z

And indeed you are right, both these sentences don't get corrected properly yet:
L'eau entre par la bouche du poisson et ressort par les opercules. -> is still a false positive by CONFUSION_ER_E_PAR[20] (that could be corrected with a simple antipattern in the corresponding subrule)
Il prend pelle. is still a false negative indeed, I can add it to the next model training.

The "rules" you see starting with AI_FR_GGEC aren't XML rules as the other ones in the grammar.xml file.
They are operation done by our Artificial Intelligence (for) Grammatical Error Correction for FRench.
These parts of LanguageTool are not Open Source, that's why you can't find them "in the project". They have an entirely different way of working, and nothing about that lives in this repository.

Sharcoux · 2024-04-08T16:02:10Z

The original rule CONFUSION_ER_E_PAR is wrong and is supposed to be replaced with CONFUSION_ER_E_PAR2. But you're right, this is not related to this PR and we can discuss it in the other one.

About the fixes made by the AI, I understand that this part might not be available in the open source project, but in this case, that means that there is no rule for those in the current project and thus, this PR is important.

LucieSteib · 2024-04-09T08:02:05Z

About the fixes made by the AI, I understand that this part might not be available in the open source project, but in this case, that means that there is no rule for those in the current project and thus, this PR is important.

The models operations are available for non-Premium users, on the Editor and the web extension.
Only the code supporting it is not accessible, it's a mt5 model massively trained that cannot be changed like XML rules, that's why it's not "in the OS repository".

LucieSteib · 2024-04-09T08:06:27Z

I remember that you want to replace CONFUSION_ER_E_PAR withCONFUSION_ER_E_PAR2 entirely, yes.
Will the version 2 remove FPs and add TPs compare to the current CONFUSION_ER_E_PAR?
You should be able to see it in the diffs, by going through a good testing, plus assessing the False Positives (if the new version is generating a ton of false positives, it's not really an improvement).
To be noted, though, the diffs are not generated properly for a few days, so we might better wait for a fix of this issue before merging this PR anyways.

Sharcoux · 2024-04-09T09:14:20Z

@LucieSteib About CONFUSION_ER_E_PAR2, I'm confident to arrive at a quite good result and I promise to analyze very carefully the diffs until I get something satisfying. However, I had to give up some detections to work around errors made in the disambiguation process, as mentioned here. In the future, I might try to improve the disambiguation to achieve even better detection. But the current version should already bring quite an improvement.

I'm not sure about how to do the replace of one rule by the other and see the diff. I just know how to see the diffs introduced by new rules being added. Maybe, when the rule is ready, I should open a PR that does the replacement?

The models operations are available for non-Premium users, on the Editor and the web extension.
Only the code supporting it is not accessible, it's a mt5 model massively trained that cannot be changed like XML rules, that's why it's not "in the OS repository".

I don't think that this is available for self hosted instances, is it? We use it for offline usage. If it's not available, then I still insist that we need a rule in the project to cover those problems. I have no problem trying to improve the rule, though, if you don't like the way it is now.

LucieSteib · 2024-04-09T09:37:50Z

No, you're right, the model would be far too heavy for a self-hosted instance, it's not accessible like that.

About VERBE_SUIVI_D_UN_NOM the gender/number agreement for the determiner, look maybe at D_N_E_OU_E[2] that is correcting "Je prends un pelle" to "une pelle":

<pattern>
    <token postag="D m s">
        <exception scope="previous">qui</exception>
        <exception regexp="yes">(?-i)CE|audit|tout</exception></token>
    <token regexp="yes" min="0" max="1">très|trop|vraiment</token>
    <token postag="N f s" regexp="yes">.*e$
        <exception scope="next">-</exception>
        <exception regexp="yes">cote|chèvre|chouette|date|attrape|bouche|trompe|pratique|couche|lave|cure|case|pile|toilette|coupe|barde|ride|bienvenue|monte|marque|porte|micro-ondes?|traine|escape|lie|lire|serge|donne|marine|martyre|malaise|(?-i)[A-Z].*|prime|tome</exception></token>
</pattern>
<filter class="org.languagetool.rules.fr.FrenchSuppressMisspelledSuggestionsFilter" args="suppressMatch:true"/>
<message>Le déterminant s'accorde avec le nom "\3".</message>
<suggestion suppress_misspelled="yes">\1 \2 <match no="3" regexp_match="(?iu)e" regexp_replace="é"/></suggestion>
<suggestion suppress_misspelled="yes">\1 \2 <match no="3" regexp_match="(?iu)e" regexp_replace=""/></suggestion>
<suggestion><match no="1" postag="(D|J) .*" postag_regexp="yes" postag_replace="$1 f s"/> \2 \3</suggestion>
<example correction="le carré|la carre">Il faut que <marker>le carre</marker> soit à gauche.</example>

About CONFUSION_ER_E_PAR2: I've nothing against the rule :) as long as it's at least as good as the current one, see the process we could try:

About the actual replacement, when the rule is ready (meaning:

there are no more FPs visible in the diff,
all the FPs AND FNs from the current are handled)
we can change the priorities in the file French.java to achieve something close to a “replacement” in the sense of pasting the new rule over the old one.
The priority will have to be higher for the new version than the old one, and below the model's priority. But we will cross that bridge when we get to the river.

LucieSteib · 2024-04-09T13:50:43Z

languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml

-                            <exception>godot</exception>
+                            <exception postag="[^N].*" postag_regexp="yes"/>
+                            <exception regexp="yes" case_sensitive="yes">[A-ZÉÈÀÙÂÊÎÔÛÄËÏÖÜÇ].*</exception>
+                            <exception regexp="yes">réparation|confirmation|famille|godot|lundi|mardi|mercredi|jeudi|vendredi|samedi|dimanche|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|début|mi-.*|fin</exception>


suggestion(format): maybe here you could use ENTITY like mois_annee and jours_semaine (but maybe also unites_temps, parties_journee...) all the ENTITIES are at the top of the grammar.xml file.

You can call them in the rule with structures like:
<token regexp="yes">&mois_annee;</token>

Remove words starting with capital letter

ccce937

Sharcoux marked this pull request as draft March 19, 2024 15:22

François Billioud added 4 commits March 20, 2024 10:47

Fix case detection

6a26b41

Handle better the non-nouns tokens

1467eea

Replace example because of disambiguation issues

19e99d0

Fix the example of "RETROUVER" because of disambiguation problem

bef2860

Sharcoux marked this pull request as ready for review March 21, 2024 13:08

LucieSteib reviewed Apr 8, 2024

View reviewed changes

LucieSteib reviewed Apr 9, 2024

View reviewed changes

LucieSteib added the French label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fr] Ignore words starting with capital letter in "VERBE_SUIVI_D_UN_NOM" ruleset #10422

[fr] Ignore words starting with capital letter in "VERBE_SUIVI_D_UN_NOM" ruleset #10422

Sharcoux commented Mar 19, 2024

Sharcoux commented Mar 21, 2024

Sharcoux commented Mar 21, 2024

Sharcoux commented Mar 21, 2024

Sharcoux commented Apr 8, 2024

LucieSteib Apr 8, 2024

Sharcoux Apr 8, 2024

Sharcoux Apr 8, 2024

LucieSteib Apr 8, 2024

Sharcoux commented Apr 8, 2024

LucieSteib commented Apr 8, 2024 •

edited

Sharcoux commented Apr 8, 2024

LucieSteib commented Apr 9, 2024

LucieSteib commented Apr 9, 2024 •

edited

Sharcoux commented Apr 9, 2024

LucieSteib commented Apr 9, 2024 •

edited

LucieSteib Apr 9, 2024

[fr] Ignore words starting with capital letter in "VERBE_SUIVI_D_UN_NOM" ruleset #10422

Are you sure you want to change the base?

[fr] Ignore words starting with capital letter in "VERBE_SUIVI_D_UN_NOM" ruleset #10422

Conversation

Sharcoux commented Mar 19, 2024

Sharcoux commented Mar 21, 2024

Sharcoux commented Mar 21, 2024

Sharcoux commented Mar 21, 2024

Sharcoux commented Apr 8, 2024

LucieSteib Apr 8, 2024

Choose a reason for hiding this comment

Sharcoux Apr 8, 2024

Choose a reason for hiding this comment

Sharcoux Apr 8, 2024

Choose a reason for hiding this comment

LucieSteib Apr 8, 2024

Choose a reason for hiding this comment

Sharcoux commented Apr 8, 2024

LucieSteib commented Apr 8, 2024 • edited

Sharcoux commented Apr 8, 2024

LucieSteib commented Apr 9, 2024

LucieSteib commented Apr 9, 2024 • edited

Sharcoux commented Apr 9, 2024

LucieSteib commented Apr 9, 2024 • edited

LucieSteib Apr 9, 2024

Choose a reason for hiding this comment

LucieSteib commented Apr 8, 2024 •

edited

LucieSteib commented Apr 9, 2024 •

edited

LucieSteib commented Apr 9, 2024 •

edited