Skip to content

pdbang/doccano-camembert-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Named Entity Recognition with Doccano and camembert 🧨

Introduction 💊

Doccano is an excellent text labeling tool for named entity recognition, but the library that processes the output of this software is not very flexible and is not updated anymore.

This library has been developed in order to make it possible to use data from Doccano with Camembert using pandas and its dataframes. Some simple modifications should allow to use other models. You just have to modify the tokenizer and the model in doccano_to_camembert.py and the character management in utils.

An example is available on the notebook ner_model.ipynb

Tokenization ✂

We used the platform Doccano for manual text labellisation, and CamemBERT for tokenization, which is a state-of-the-art language model for French based on the RoBERTa model. CamemBERT can only deal with numerical class values, so we had to adapt the initial output of Doccano.

The scoring of the final model was made with Kaggle, so we had to convert our output to a specific submission model.

The functions for tokenization and labels management are stored in scripts

Running the NER CamemBERT-based model 🧀

The final NER model is located at ner_model.ipynb.

About

Adaptation du github https://github.com/doccano/doccano-transformer pour utiliser les textes labellisés sur Doccano avec camembert.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published