Skip to content

Code from the paper "A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation" (LREC-Coling 2024)

Notifications You must be signed in to change notification settings

macairececile/picto_grammar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation

This is the scripts to run the formalism from the LREC-Coling 2024 paper "A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation".

⭐ Do not hesitate to test it and report any bugs, feedbacks, results.

🔍 Overview

Our PictoGrammar model goal is to provide an Arasaac pictogram translation of a speech transcription (or text).

PictoGrammar uses a set of models : (1) SpaCy to tokenize, lemmatize, and post-tag, (2) a Named Entity Recognition (NER) model based on CamemBERT, and (3) a Word Sense Disambiguation (WSD) model.

⚙️ Requirements and Installation

git clone https://github.com/macairececile/picto_grammar.git
cd picto_grammar/
pip install -r requirements.txt
git clone https://github.com/macairececile/nwsd.git
export PYTHONPATH=$PYTHONPATH:/path_to_nwsd/nwsd/src

Then, download the WSD model via this link: https://cloud.univ-grenoble-alpes.fr/s/XECiw4gmEbGDprD and decompress it in picto_grammar/data/ folder.

📉 Running PictoGrammar

The repository is organized in 4 folders :

  • src/ -- python scripts to run the grammar.
  • img/ -- images.
  • data/ -- folder with the data used in the paper.
  • examples/ -- folder with examples of output files generated by the grammar.

Data format

  • Input data format : a .csv file with two tab-separated columns (see example file in examples/input.csv)
clips text
cefc-tcof-Acc_del_07-118 mh il y a pas longtemps j'ai revu une tante
cefc-tcof-Acc_del_07-112 oh ben ouais euh enfin c'est je sais
cefc-tcof-Acc_del_07-166 tu dis euh un pneu de voiture
  • Output data format :

A .csv file with 4 tab-separated columns (see example file in examples/output.csv) :

clips text text_process pictos tokens
cefc-tcof-Acc_del_07-118 mh il y a pas longtemps j'ai revu une tante mh il y a pas longtemps j'ai revu une tante [9839, 9001, 5526, 37678, 6632, 37163, 6564, 8474, 30276] passé il_y_a non longtemps me une_autre_fois voir une tante
cefc-tcof-Acc_del_07-112 oh ben ouais euh enfin c'est je sais oh ben oui euh enfin c'est je sais [5584, 7095, 36480, 6632, 16885] oui celui-là être me savoir
cefc-tcof-Acc_del_07-166 tu dis euh un pneu de voiture tu dis euh un pneu de voiture [6625, 9693, 2627, 37072, 7074, 2339] toi dire un pneu de voiture

A .html file to visualize the generated pictogram sequence per utterance (see example file in examples/out.html).

Use the grammar

python src/grammar.py --wn_file "data/dico/index.sense" --no_transl "data/dico/no_translation.csv" --wsd "data/wsd_model/" --lexicon "data/dico/lexique.csv" --data "examples/input.csv" --out "examples/out.csv" --tags "data/dico/tags.csv"

An out.html file will be generated to see the output sequence.

📝 Citation

@inproceedings{macaire_lrec2024,
  title = {A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation},
  author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Arrigo, Jordan and Lemaire, Claire and Esperan{\c c}a-Rodier, Emmanuelle and Lecouteux, Benjamin and Schwab, Didier},
  url = {https://hal.science/hal-04534234},
  booktitle = {LREC-Coling},
  address = {Turin, Italy},
  year = {2024},
  month = May,
  keywords = {Pictograms ; Speech ; Machine Translation},
  pdf = {https://hal.science/hal-04534234/file/1210_Paper_LREC_Coling_Macaire.pdf},
  hal_id = {hal-04534234}
}

About

Code from the paper "A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation" (LREC-Coling 2024)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages