LEM17

Linguistically annotated corpora of modern French (16-18th c.) with Pie models

«Sisyphe portant CornMol» (Titian, Prado Museum, Madrid, Spain, Source: Wikipedia).

Data

We provide:

Several authority lists, two deriving from LGeRM.

One list contains only propre nouns (proper) with the latest added at the end
One list contains all the other lemmas (authority) with the latest added at the end
One list contains all the foreign words (foreign) with the latest added at the end
Each file has a _processed version with all the entries in the alphabetical order, after controlling that there is not twice the same entry
On top of these three files, numbers contains latin and arabic numbers and alphabet contains single latin letters.

Training data:

CornMol is a gold corpus to be published
FranText is a corpus taken from the open data of FranText and aligned on our lemmatisation standards.
presto_gold is a gold corpus used by the Presto project tro train their TreeTagger model, converted to CATTEX and lightly corrected to match our authority lists.
presto_max have all the modern (16th-18th c.) texts of the Presto project, with lemmas heavily corrected. Each round of annotation/correction is numbered (v2, v3…)

Out-of-domain testing data for 16th, 17th, 18th, 19th and 20th c. French

Data are separated: theatrical and non theatrical for historical reasons.
The same data exist in two versions: normalised and original (19th and 20th remains the same, only 16th, 17th and 18th change).

The Models folder contains all the models produced with our data.

|-Authority_list
  |-authority_processed
  |-authority
  |-propres_processed
  |-propres
  |-foreign
|-Data
  |-CornMol_gold
  |-FranText
  |-presto_max
  |-presto_gold
|-Data_outOfDomain
  |-Data_outOfDomain_normalised
    |-theatre_normalised
    |-varia_normalised
  |-Data_outOfDomain_original
    |-theatre_original
    |-varia_original
|-Models
  |-train_1
  |-train_2
    |-Models
      |-lemma.tar
      |-pos.tar

Use the lemmatiser

To use the model,

Create a (virtualenv env) and activate it (source env/bin/activate)
Install Pie-extended: pip install pie-extended
Download the freem model: pie-extended download
Use the freem model: pie-extended tag freem your_file.txt

Do note that pie-extended includes a tokeniser dedicated to (early-)modern French.

Warnings

The morphology is provided but has not been carefully proofread.

Licences

Our work is licensed under a Creative Commons Attribution 4.0 International Licence.

Presto and LGeRM data are licensed under a Creative Commons Attribution 4.0 International Licence.

Contribute

If you want to contribute, you can do so by cloning the repository and sending us a pull request, or by sending an email at simon.gabay[at]unige.ch.

Cite this repository

Simon Gabay, Thibault Clérice, Matthias Gille-Levenson, Jean-Baptiste Camps, Jean-Baptiste Tanguy, LEM17: data and models for modern French (16-18th c.), Neuchâtel: Université de Neuchâtel, 2020, https://github.com/e-ditiones/LEM17.

Please keep me posted if you use this data! simon.gabay[at]unige.ch

Contact

simon.gabay[at]unige.ch

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
Authority_list		Authority_list
Data		Data
Data_outOfDomain		Data_outOfDomain
Models		Models
images		images
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Authority_list

Authority_list

Data

Data

Data_outOfDomain

Data_outOfDomain

Models

Models

images

images

.gitattributes

.gitattributes

README.md

README.md

Repository files navigation

LEM17

Data

Use the lemmatiser

Warnings

Licences

Contribute

Cite this repository

Contact

About

Releases

Packages

Contributors 2

Languages

e-ditiones/LEM17

Folders and files

Latest commit

History

Repository files navigation

LEM17

Data

Use the lemmatiser

Warnings

Licences

Contribute

Cite this repository

Contact

About

Resources

Stars

Watchers

Forks

Languages