Skip to content

e-ditiones/LEM17

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LEM17

Linguistically annotated corpora of modern French (16-18th c.) with Pie models

100% center

«Sisyphe portant CornMol» (Titian, Prado Museum, Madrid, Spain, Source: Wikipedia).

Data

We provide:

  1. Several authority lists, two deriving from LGeRM.
  • One list contains only propre nouns (proper) with the latest added at the end
  • One list contains all the other lemmas (authority) with the latest added at the end
  • One list contains all the foreign words (foreign) with the latest added at the end
  • Each file has a _processed version with all the entries in the alphabetical order, after controlling that there is not twice the same entry
  • On top of these three files, numbers contains latin and arabic numbers and alphabet contains single latin letters.
  1. Training data:
  • CornMol is a gold corpus to be published
  • FranText is a corpus taken from the open data of FranText and aligned on our lemmatisation standards.
  • presto_gold is a gold corpus used by the Presto project tro train their TreeTagger model, converted to CATTEX and lightly corrected to match our authority lists.
  • presto_max have all the modern (16th-18th c.) texts of the Presto project, with lemmas heavily corrected. Each round of annotation/correction is numbered (v2, v3…)
  1. Out-of-domain testing data for 16th, 17th, 18th, 19th and 20th c. French
  • Data are separated: theatrical and non theatrical for historical reasons.
  • The same data exist in two versions: normalised and original (19th and 20th remains the same, only 16th, 17th and 18th change).
  1. The Models folder contains all the models produced with our data.
|-Authority_list
  |-authority_processed
  |-authority
  |-propres_processed
  |-propres
  |-foreign
|-Data
  |-CornMol_gold
  |-FranText
  |-presto_max
  |-presto_gold
|-Data_outOfDomain
  |-Data_outOfDomain_normalised
    |-theatre_normalised
    |-varia_normalised
  |-Data_outOfDomain_original
    |-theatre_original
    |-varia_original
|-Models
  |-train_1
  |-train_2
    |-Models
      |-lemma.tar
      |-pos.tar

Use the lemmatiser

To use the model,

  1. Create a (virtualenv env) and activate it (source env/bin/activate)
  2. Install Pie-extended: pip install pie-extended
  3. Download the freem model: pie-extended download
  4. Use the freem model: pie-extended tag freem your_file.txt

Do note that pie-extended includes a tokeniser dedicated to (early-)modern French.

Warnings

The morphology is provided but has not been carefully proofread.

Licences

Licence Creative Commons
Our work is licensed under a Creative Commons Attribution 4.0 International Licence.

Licence Creative Commons
Presto and LGeRM data are licensed under a Creative Commons Attribution 4.0 International Licence.

Contribute

If you want to contribute, you can do so by cloning the repository and sending us a pull request, or by sending an email at simon.gabay[at]unige.ch.

Cite this repository

Simon Gabay, Thibault Clérice, Matthias Gille-Levenson, Jean-Baptiste Camps, Jean-Baptiste Tanguy, LEM17: data and models for modern French (16-18th c.), Neuchâtel: Université de Neuchâtel, 2020, https://github.com/e-ditiones/LEM17.

Please keep me posted if you use this data! simon.gabay[at]unige.ch

Contact

simon.gabay[at]unige.ch

About

Data and models for lemmatising and POS-tagging modern French (16-18th c.)

Resources

Stars

Watchers

Forks

Packages

No packages published