comprehensive french tokenizer without exceptions list #13378

thjbdvlt · 2024-03-15T11:33:44Z

The current french tokenizer doesn't handle hyphens and apostrophes very well. It uses a gigantic (15600) list of words with hyphen that must not be split on the hyphen. This list is not only huge (full of village names such as Minaucourt-le-Mesnil-lès-Hurlus, or Beaujeu-Saint-Vallier-Pierrejux-et-Quitteur), but also very incomplete. This list has no chance to ever become exhaustive, because the number of french common nouns and proper names that contain a hypen and must not be split by the tokenizer is virtually infinite: the hyphen is called in french trait d'union (union trait), it unifies, it joins separate words into one semantic word (and token). For example, the verb porter (to carry) produces nouns porte-clé (a thing we use to carry keys), porte-manteau, and we can invent any word like this (with porter or any other word). Plus, there is inclusive language (relecteur-rice-s). And of course there are people and places names, wich often containd hyphens, combining existing names or words into new and larger names. At the other hand, there are cases where a hyphen must split a substring into two words, and these cases are easily handled with a simple regex, because unlike the infinite exceptions, they are not very diverse: a) verb-subject inversion where subject is pronominalized; b) verb-object form where object is pronominalized; for a total of 21 words (suffixes). This current pull requests replaces the tokenizer exceptions by a new 're_infixes' function, that easily handles each of the 15600 exceptions, and many more. It reverses the rule-exception relation: rule = keep as one token the words containing a hyphen; exception = split words containing a hyphen if the hyphen is followed by one of the registered word (pronominalized subject/object).

modifié : __init__.py modifié : punctuation.py modifié : tokenizer_exceptions.py

thjbdvlt · 2024-03-26T10:32:32Z

i apologize for all these failed tests!! it's the first time i contribute to a project (i'm not a programmer: i study french literature) and i just finally understood that i could do these tests by myself: now it doesnt fail anymore. sorry again and thank's for having look at my pull request :)

thjbdvlt added 4 commits March 15, 2024 08:45

refactored french tokenizer

5a3928f

works

ef4c655

modifié : __init__.py modifié : punctuation.py modifié : tokenizer_exceptions.py

end french tokenizer + add sentences (examples)

8e8dd31

removed few new sentences

73b68c4

thjbdvlt changed the base branch from main to master March 15, 2024 14:05

svlandeg added the lang / fr French language data and models label Mar 25, 2024

svlandeg marked this pull request as draft March 25, 2024 12:52

thjbdvlt added 7 commits March 25, 2024 14:34

formatted with black

3962669

isort + flake8 fixed

f3967e8

fixed variable types

6578492

missing suffix: 'ce'

43aec2d

add months + days abbrev

b0b66fb

add dot after some abbrevs

54fcc5b

add abbrev juill. (same as juil.)

0f1a82b

add elleux

aee667a

thjbdvlt marked this pull request as ready for review March 28, 2024 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comprehensive french tokenizer without exceptions list #13378

comprehensive french tokenizer without exceptions list #13378

thjbdvlt commented Mar 15, 2024 •

edited

thjbdvlt commented Mar 26, 2024 •

edited

comprehensive french tokenizer without exceptions list #13378

Are you sure you want to change the base?

comprehensive french tokenizer without exceptions list #13378

Conversation

thjbdvlt commented Mar 15, 2024 • edited

thjbdvlt commented Mar 26, 2024 • edited

thjbdvlt commented Mar 15, 2024 •

edited

thjbdvlt commented Mar 26, 2024 •

edited