Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comprehensive french tokenizer without exceptions list #13378

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

thjbdvlt
Copy link

@thjbdvlt thjbdvlt commented Mar 15, 2024

The current french tokenizer doesn't handle hyphens and apostrophes very well. It uses a gigantic (15600) list of words with hyphen that must not be split on the hyphen. This list is not only huge (full of village names such as Minaucourt-le-Mesnil-lès-Hurlus, or Beaujeu-Saint-Vallier-Pierrejux-et-Quitteur), but also very incomplete. This list has no chance to ever become exhaustive, because the number of french common nouns and proper names that contain a hypen and must not be split by the tokenizer is virtually infinite: the hyphen is called in french trait d'union (union trait), it unifies, it joins separate words into one semantic word (and token). For example, the verb porter (to carry) produces nouns porte-clé (a thing we use to carry keys), porte-manteau, and we can invent any word like this (with porter or any other word). Plus, there is inclusive language (relecteur-rice-s). And of course there are people and places names, wich often containd hyphens, combining existing names or words into new and larger names. At the other hand, there are cases where a hyphen must split a substring into two words, and these cases are easily handled with a simple regex, because unlike the infinite exceptions, they are not very diverse: a) verb-subject inversion where subject is pronominalized; b) verb-object form where object is pronominalized; for a total of 21 words (suffixes). This current pull requests replaces the tokenizer exceptions by a new 're_infixes' function, that easily handles each of the 15600 exceptions, and many more. It reverses the rule-exception relation: rule = keep as one token the words containing a hyphen; exception = split words containing a hyphen if the hyphen is followed by one of the registered word (pronominalized subject/object).

modifié :         __init__.py
modifié :         punctuation.py
modifié :         tokenizer_exceptions.py
@thjbdvlt thjbdvlt changed the base branch from main to master March 15, 2024 14:05
@svlandeg svlandeg added the lang / fr French language data and models label Mar 25, 2024
@svlandeg svlandeg marked this pull request as draft March 25, 2024 12:52
@thjbdvlt
Copy link
Author

thjbdvlt commented Mar 26, 2024

i apologize for all these failed tests!! it's the first time i contribute to a project (i'm not a programmer: i study french literature) and i just finally understood that i could do these tests by myself: now it doesnt fail anymore. sorry again and thank's for having look at my pull request :)

@thjbdvlt thjbdvlt marked this pull request as ready for review March 28, 2024 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / fr French language data and models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants