Language Models

Repository of pre-trained Language Models and NLP models.

unstructured library | Get the JSON and HTML versions of any PDF (legal, financial, medical…), even PDF with tables!

Blog post: unstructured library | Get the JSON and HTML versions of any PDF (legal, financial, medical…), even PDF with tables!
Notebook: Unstructured_PDF_to_JSON_and_HTML.ipynb

Speech-to-Text | Get transcription WITH SPEAKERS from large audio file in any language (OpenAI Whisper + NeMo Speaker Diarization)

Blog post: Speech-to-Text | Get transcription WITH SPEAKERS from large audio file in any language (OpenAI Whisper + NeMo Speaker Diarization)
Notebook: speech_to_text_transcription_with_speakers_Whisper_Transcription_+_NeMo_Diarization.ipynb

Video-to-Audio | A notebook and Web APP to get mp3 audio file from any YouTube video

Blog post: Video-to-Audio | A notebook and Web APP to get mp3 audio file from any YouTube video
Notebook: youtube_video_to_audio.ipynb
Web APP: Free YouTube URL Video-to-Audio

Speech-to-Text | Quickly get a transcription of a large audio file in any language with "Faster-Whisper"

Curso | ChatGPT Prompt Engineering for Developers

Blog post: IA Generativa | Como controlar o ChatGPT para escrever um texto que atenda às suas expectativas (curso de DeepLearning.AI e OpenAI)

Document AI | Accuracy of layout finetuned models (LiLT and LayoutXLM base) on the dataset DoclayNet base (notebooks)

Document AI | Inference at paragraph level by using the association of 2 Document Understanding models (LiLT and LayoutXLM base fine-tuned on DocLayNet base dataset)

Document AI | APP to compare the Document Understanding LiLT and LayoutXLM (base) models at paragraph level

Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level with LayoutXLM base

Document AI | APP to compare the Document Understanding LiLT and LayoutXLM (base) models at line level

Document AI | Inference APP and fine-tuning notebook for Document Understanding at line level with LayoutXLM base

Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level

Document AI | Inference APP for Document Understanding at line level

Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset

Blog Post: Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset
Notebook (update on 02/14/2023): Document AI | Inference at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)
Notebook: Document AI | Fine-tune LiLT on DocLayNet base in any language at line level (chunk of 384 tokens with overlap)

DocLayNet image viewer APP

Blog post: Document AI | DocLayNet image viewer APP
Notebook DocLayNet image viewer APP

Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)

Speech-to-Text & IA | Transcrição de qualquer áudio em português com Whisper

Blog post: Speech-to-Text & IA | Transcreva qualquer áudio para o português com o Whisper (OpenAI)... sem nenhum custo!
Notebook Whisper em português
Notebook Whisper en français
Notebook Inference code for Whisper (example with Whisper Medium in Portuguese)

IA & empresas | Diminua o tempo de inferência de modelos Transformer com BetterTransformer

NLP & Código para todos | Função de perda ponderada para classificação de texto (multiclasse)

NLP nas empresas | Como eu treinei um modelo T5 em português na tarefa QA no Google Colab

Blog post: NLP nas empresas | Como eu treinei um modelo T5 em português na tarefa QA no Google Colab
Notebook: Finetuning of the language model T5 base on a Question-Answering task (QA) with the dataset SQuAD 1.1 Portuguese
QA App in the Hugging Face Spaces

NLP | Modelos e Web App para Reconhecimento de Entidade Nomeada (NER) no domínio jurídico brasileiro

Blog post: NLP | Modelos e Web App para Reconhecimento de Entidade Nomeada (NER) no domínio jurídico brasileiro
NER App in the Hugging Face Spaces

Finetuning of the specialized version of the language model BERTimbau on a token classification task (NER) with the dataset LeNER-Br

notebook: HuggingFace_Notebook_token_classification_NER_LeNER_Br.ipynb (nbviewer of the notebook)
BERT base NER model in the legal domain in Portuguese (LeNER-Br) in the Hugging Face model hub
BERT large NER model in the legal domain in Portuguese (LeNER-Br) in the Hugging Face model hub

Finetuning of the language model BERTimbau on LeNER-Br text files

notebook: Finetuning_language_model_BERtimbau_LeNER_Br.ipynb (nbviewer of the notebook)
dataset: pierreguillou/lener_br_finetuning_language_model
BERT base Language modeling in the legal domain in Portuguese (LeNER-Br) in the Hugging Face model hub
BERT large Language modeling in the legal domain in Portuguese (LeNER-Br) in the Hugging Face model hub

NLP nas empresas | Técnicas para acelerar modelos de Deep Learning para inferência em produção

NLP nas empresas | Reconhecimento de textos com Deep Learning em PDFs e imagens

NLP nas empresas | Como criar um modelo BERT de Question-Answering (QA) de desempenho aprimorado com AdapterFusion?

notebook question_answering_adapter_fusion.ipynb (nbviewer of the notebook): finetuning a MLM (Masked Language Model) like BERT (base or large) with the library adapter-transformers on the Question Answering task (QA) with AdapterFusion
Blog post: NLP nas empresas | Como criar um modelo BERT de Question-Answering (QA) de desempenho aprimorado com AdapterFusion?

NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT para a tarefa de Question-Answering (QA) com um Adapter?

notebooks question_answering_adapter.ipynb (nbviewer of the notebook) and question_answering_adapter_script.ipynb (nbviewer of the notebook): finetuning a MLM (Masked Language Model) like BERT (base or large) with the library adapter-transformers on the Question Answering task (QA)
Blog post: NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT para a tarefa de Question-Answering (QA) com um Adapter?

NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT para a tarefa de classificação de tokens (NER) com um Adapter?

notebook token_classification_adapter.ipynb (nbviewer of the notebook): finetuning a MLM (Masked Language Model) like BERT (base or large) with the library adapter-transformers on the Token Classification task (NER)
Blog post: NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT para a tarefa de classificação de tokens (NER) com um Adapter?

NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT a um novo domínio linguístico com um Adapter?

notebook language_modeling_adapter.ipynb (nbviewer of the notebook): finetuning a MLM (Masked Language Model) like BERT (base or large) with the library adapter-transformers
Blog post: NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT a um novo domínio linguístico com um Adapter?

NLP | Modelo de Question Answering em qualquer idioma baseado no BERT large (estudo de caso em português)

notebook question_answering_BERT_large_cased_squad_v11_pt.ipynb (nbviewer of the notebook): training code of a Portuguese BERT large cased QA (Question Answering), finetuned on SQUAD v1.1
Blog post: NLP | Como treinar um modelo de Question Answering em qualquer linguagem baseado no BERT large, melhorando o desempenho do modelo utilizando o BERT base? (estudo de caso em português)
Model in the Model Hub of Hugging Face: Portuguese BERT large cased QA (Question Answering), finetuned on SQUAD v1.1

NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece

Summary: In some cases, it may be crucial to enrich the vocabulary of an already trained natural language model with vocabulary from a specialized domain (medicine, law, etc.) in order to perform new tasks (classification, NER, summary, translation, etc.). While the Hugging Face library allows you to easily add new tokens to the vocabulary of an existing tokenizer like BERT WordPiece, those tokens must be whole words, not subwords. This article explains why and how to obtain these new tokens from a specialized corpus.

NLP | Modelo de Question Answering em qualquer idioma baseado no BERT base (estudo de caso em português)

notebook colab_question_answering_BERT_base_cased_squad_v11_pt.ipynb (nbviewer of the notebook): training code of a Portuguese BERT base cased QA (Question Answering), finetuned on SQUAD v1.1
Blog post: NLP | Modelo de Question Answering em qualquer idioma baseado no BERT base (estudo de caso em português)
Model in the Model Hub of Hugging Face: Portuguese BERT base cased QA (Question Answering), finetuned on SQUAD v1.1

Portuguese

I trained 1 Portuguese Bidirectional Language Model (PBLM) with the MultiFit configuration with 1 NVIDIA GPU v100 on GCP.

WARNING: a Bidirectional LM model using the MultiFiT configuration is a good model to perform text classification but with only 46 millions of parameters, it is far from being a LM that can compete with GPT-2 or BERT in NLP tasks like text generation. This my next step ;-)

Note: The training times shown in the tables on this page are the sum of the creation time of Fastai Databunch (forward and backward) and the training duration of the bidirectional model over 10 periods. The download time of the Wikipedia corpus and its preparation time are not counted.

MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

notebook lm3-portuguese.ipynb (nbviewer of the notebook): code used to train a Portuguese Bidirectional LM on a 100 millions corpus extrated from Wikipedia by using the MultiFiT configuration.
link to download pre-trained parameters and vocabulary in models

PBLM	accuracy	perplexity	training time
forward	39.68%	21.76	8h
backward	43.67%	22.16	8h

Applications:
- notebook lm3-portuguese-classifier-TCU-jurisprudencia.ipynb (nbviewer of the notebook): code used to fine-tune a Portuguese Bidirectional LM and a Text Classifier on "(reduzido) TCU jurisprudência" dataset.
- notebook lm3-portuguese-classifier-olist.ipynb (nbviewer of the notebook): code used to fine-tune a Portuguese Bidirectional LM and a Sentiment Classifier on "Brazilian E-Commerce Public Dataset by Olist" dataset.

[ WARNING ] The code of this notebook lm3-portuguese-classifier-olist.ipynb must be updated in order to use the SentencePiece model and vocab already trained for the Portuguese Language Model in the notebook lm3-portuguese.ipynb as it was done in the notebook lm3-portuguese-classifier-TCU-jurisprudencia.ipynb (see explanations at the top of this notebook).

Here's an example of using the classifier to predict the category of a TCU legal text:

French

I trained 3 French Bidirectional Language Models (FBLM) with 1 NVIDIA GPU v100 on GCP but the best is the one trained with the MultiFit configuration.

French Bidirectional Language Models (FBLM)		accuracy	perplexity	training time
MultiFiT with 4 QRNN + SentencePiece (15 000 tokens)	forward	43.77%	16.09	8h40
	backward	49.29%	16.58	8h10
ULMFiT with 3 QRNN + SentencePiece (15 000 tokens)	forward	40.99%	19.96	5h30
	backward	47.19%	19.47	5h30
ULMFiT with 3 AWD-LSTM + spaCy (60 000 tokens)	forward	36.44%	25.62	11h
	backward	42.65%	27.09	11h

1. MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

notebook lm3-french.ipynb (nbviewer of the notebook): code used to train a French Bidirectional LM on a 100 millions corpus extrated from Wikipedia by using the MultiFiT configuration.
link to download pre-trained parameters and vocabulary in models

FBLM	accuracy	perplexity	training time
forward	43.77%	16.09	8h40
backward	49.29%	16.58	8h10

Application: notebook lm3-french-classifier-amazon.ipynb (nbviewer of the notebook): code used to fine-tune a French Bidirectional LM and a Sentiment Classifier on "French Amazon Customer Reviews" dataset.

Here's an example of using the classifier to predict the feeling of comments on an amazon product:

2. Architecture QRNN / tokenizer SentencePiece

notebook lm2-french.ipynb (nbviewer of the notebook): code used to train a French Bidirectional LM on a 100 millions corpus extrated from Wikipedia
link to download pre-trained parameters and vocabulary in models

FBLM	accuracy	perplexity	training time
forward	40.99%	19.96	5h30
backward	47.19%	19.47	5h30

Application: notebook lm2-french-classifier-amazon.ipynb (nbviewer of the notebook): code used to fine-tune a French Bidirectional LM and a Sentiment Classifier on "French Amazon Customer Reviews" dataset.

3. Architecture AWD-LSTM / tokenizer spaCy

notebook lm-french.ipynb (nbviewer of the notebook): code used to train a French Bidirectional LM on a 100 millions corpus extrated from Wikipedia
link to download pre-trained parameters and vocabulary in models

FBLM	accuracy	perplexity	training time
forward	36.44%	25.62	11h
backward	42.65%	27.09	11h

Application: notebook lm-french-classifier-amazon.ipynb (nbviewer of the notebook): code used to fine-tune a French Bidirectional LM and a Sentiment Classifier on "French Amazon Customer Reviews" dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 643 Commits
adapters		adapters
audio		audio
chatgpt		chatgpt
docs		docs
images		images
models		models
ACCURACY_of_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		ACCURACY_of_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
ACCURACY_of_LiLT_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelline_ml384.ipynb		ACCURACY_of_LiLT_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelline_ml384.ipynb
ACCURACY_of_LiLT_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		ACCURACY_of_LiLT_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
DocLayNet_image_viewer_APP.ipynb		DocLayNet_image_viewer_APP.ipynb
Fine_tune_LayoutXLM_base_on_DocLayNet_base_in_any_language_at_linelevel_ml_384.ipynb		Fine_tune_LayoutXLM_base_on_DocLayNet_base_in_any_language_at_linelevel_ml_384.ipynb
Fine_tune_LayoutXLM_base_on_DocLayNet_base_in_any_language_at_paragraphlevel_ml_512.ipynb		Fine_tune_LayoutXLM_base_on_DocLayNet_base_in_any_language_at_paragraphlevel_ml_512.ipynb
Fine_tune_LiLT_on_DocLayNet_base_in_any_language_at_linelevel_ml_384.ipynb		Fine_tune_LiLT_on_DocLayNet_base_in_any_language_at_linelevel_ml_384.ipynb
Fine_tune_LiLT_on_DocLayNet_base_in_any_language_at_paragraphlevel_ml_512.ipynb		Fine_tune_LiLT_on_DocLayNet_base_in_any_language_at_paragraphlevel_ml_512.ipynb
Finetuning_language_model_BERtimbau_LeNER_Br.ipynb		Finetuning_language_model_BERtimbau_LeNER_Br.ipynb
Gradio_inference_on_Ensemble_LiLT_&_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		Gradio_inference_on_Ensemble_LiLT_&_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
Gradio_inference_on_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb		Gradio_inference_on_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb
Gradio_inference_on_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		Gradio_inference_on_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
Gradio_inference_on_LiLT_&_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		Gradio_inference_on_LiLT_&_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
Gradio_inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb		Gradio_inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb
Gradio_inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		Gradio_inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
HuggingFace_Notebook_t5_base_portuguese_vocab_question_answering_QA_squad_v11_pt.ipynb		HuggingFace_Notebook_t5_base_portuguese_vocab_question_answering_QA_squad_v11_pt.ipynb
HuggingFace_Notebook_token_classification_NER_LeNER_Br.ipynb		HuggingFace_Notebook_token_classification_NER_LeNER_Br.ipynb
OCR_DeepLearning_Tesseract_DocTR_colab.ipynb		OCR_DeepLearning_Tesseract_DocTR_colab.ipynb
README.md		README.md
Speech_to_Text_with_faster_whisper_on_large_audio_file_in_any_language.ipynb		Speech_to_Text_with_faster_whisper_on_large_audio_file_in_any_language.ipynb
Text_Classification_on_GLUE_with_weighted_Loss.ipynb		Text_Classification_on_GLUE_with_weighted_Loss.ipynb
Unstructured_PDF_to_JSON_and_HTML.ipynb		Unstructured_PDF_to_JSON_and_HTML.ipynb
Whisper_Medium_French_GPU.ipynb		Whisper_Medium_French_GPU.ipynb
Whisper_Medium_Portuguese_GPU.ipynb		Whisper_Medium_Portuguese_GPU.ipynb
colab_question_answering_BERT_base_cased_squad_v11_pt.ipynb		colab_question_answering_BERT_base_cased_squad_v11_pt.ipynb
converter.py		converter.py
download_DocLayNet_large_21abril2023.ipynb		download_DocLayNet_large_21abril2023.ipynb
fast_inference_transformers_on_CPU.ipynb		fast_inference_transformers_on_CPU.ipynb
fast_inference_transformers_on_GPU.ipynb		fast_inference_transformers_on_GPU.ipynb
gradio.png		gradio.png
inference_code_whisper_example_with_Portuguese.ipynb		inference_code_whisper_example_with_Portuguese.ipynb
inference_on_Ensemble_LiLT_&_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		inference_on_Ensemble_LiLT_&_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
inference_on_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb		inference_on_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb
inference_on_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		inference_on_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb		inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb
inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb		inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb
lm-french-classifier-amazon.ipynb		lm-french-classifier-amazon.ipynb
lm-french-generator.ipynb		lm-french-generator.ipynb
lm-french.ipynb		lm-french.ipynb
lm2-french-classifier-amazon.ipynb		lm2-french-classifier-amazon.ipynb
lm2-french.ipynb		lm2-french.ipynb
lm3-french-classifier-amazon.ipynb		lm3-french-classifier-amazon.ipynb
lm3-french.ipynb		lm3-french.ipynb
lm3-portuguese-classifier-TCU-jurisprudencia.ipynb		lm3-portuguese-classifier-TCU-jurisprudencia.ipynb
lm3-portuguese-classifier-olist.ipynb		lm3-portuguese-classifier-olist.ipynb
lm3-portuguese.ipynb		lm3-portuguese.ipynb
nlp_how_to_add_a_domain_specific_vocabulary_new_tokens_to_a_subword_tokenizer_already_trained_like_BERT_WordPiece.ipynb		nlp_how_to_add_a_domain_specific_vocabulary_new_tokens_to_a_subword_tokenizer_already_trained_like_BERT_WordPiece.ipynb
nlputils2.py		nlputils2.py
processing_DocLayNet_dataset_to_be_used_by_layout_models_of_HF_hub.ipynb		processing_DocLayNet_dataset_to_be_used_by_layout_models_of_HF_hub.ipynb
question_answering_BERT_large_cased_squad_v11_pt.ipynb		question_answering_BERT_large_cased_squad_v11_pt.ipynb
question_answering_portuguese_with_BetterTransformer.ipynb		question_answering_portuguese_with_BetterTransformer.ipynb
speech_to_text_transcription_with_speakers_Whisper_Transcription_+_NeMo_Diarization.ipynb		speech_to_text_transcription_with_speakers_Whisper_Transcription_+_NeMo_Diarization.ipynb
youtube_video_to_audio.ipynb		youtube_video_to_audio.ipynb

piegu/language-models

Folders and files

Latest commit

History

Repository files navigation