ALBERT-Mongolian

This repo provides pretrained ALBERT model ("A Lite" version of BERT) and SentencePiece model (unsupervised text tokenizer and detokenizer) trained on Mongolian text corpus.

Contents:

Usage

You can use ALBERT-Mongolian in both PyTorch and TensorFlow2.0 using transformers library.

link to HuggingFace model card 🤗

import torch
from transformers import AlbertTokenizer, AlbertForMaskedLM

tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/albert-mongolian')
model = AlbertForMaskedLM.from_pretrained('bayartsogt/albert-mongolian')

Tutorials

[Colab] Text classification using TPU on Colab: ALBERT_Mongolian_text_classification.ipynb
[Colab] Masked Language Modeling (MLM) on Colab: ALBERT_Mongolian_MLM.ipynb
[Video] AWS-Mongolians e-meetup #3:

Results

Model	Problem	Task	weighted F1
ALBERT-base	Text Classification	Eduge dataset	0.90
...	...	...	...

Comparison between ALBERT and BERT

Note that While ALBERT-base is compatible in terms of results shown below, it is over 10 times (only 135MB) smaller than BERT-base (1.2GB).

ALBERT-Mongolian:

                          precision    recall  f1-score   support

            байгал орчин       0.85      0.83      0.84       999
               боловсрол       0.80      0.80      0.80       873
                   спорт       0.98      0.98      0.98      2736
               технологи       0.88      0.93      0.91      1102
                 улс төр       0.92      0.85      0.89      2647
              урлаг соёл       0.93      0.94      0.94      1457
                   хууль       0.89      0.87      0.88      1651
             эдийн засаг       0.83      0.88      0.86      2509
              эрүүл мэнд       0.89      0.92      0.90      1159

                accuracy                           0.90     15133
               macro avg       0.89      0.89      0.89     15133
            weighted avg       0.90      0.90      0.90     15133

BERT-Mongolian: from Mongolian Text Classification

                          precision    recall  f1-score   support

            байгал орчин       0.82      0.84      0.83       999
               боловсрол       0.91      0.70      0.79       873
                   спорт       0.97      0.98      0.97      2736
               технологи       0.91      0.85      0.88      1102
                 улс төр       0.87      0.86      0.86      2647
              урлаг соёл       0.88      0.96      0.92      1457
                   хууль       0.86      0.85      0.86      1651
             эдийн засаг       0.84      0.87      0.85      2509
              эрүүл мэнд       0.90      0.90      0.90      1159

                accuracy                           0.88     15133
               macro avg       0.88      0.87      0.87     15133
            weighted avg       0.88      0.88      0.88     15133

Reproduce

Pretrain from Scratch: You can follow the PRETRAIN_SCRATCH.md to reproduce the results.

Here is pretraining loss:

Reference

Citation

@misc{albert-mongolian,
  author = {Bayartsogt Yadamsuren},
  title = {ALBERT Pretrained Model on Mongolian Datasets},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
albert @ c21d8a3		albert @ c21d8a3
configs		configs
datasets		datasets
images		images
mn_corpus		mn_corpus
sp_models		sp_models
transformers		transformers
wikiextractor @ 16186e2		wikiextractor @ 16186e2
.gitmodules		.gitmodules
ALBERT_Mongolian_text_classification.ipynb		ALBERT_Mongolian_text_classification.ipynb
LICENSE		LICENSE
PRETRAIN_SCRATCH.md		PRETRAIN_SCRATCH.md
README.md		README.md
albert_base_config.json		albert_base_config.json
build_pretraining_data.sh		build_pretraining_data.sh
do_lowercase.py		do_lowercase.py
requirement.txt		requirement.txt
text_classification_ALBERT_Mongolian.ipynb		text_classification_ALBERT_Mongolian.ipynb
train_spm_model.sh		train_spm_model.sh

License

bayartsogt-ya/albert-mongolian

Folders and files

Latest commit

History

Repository files navigation

ALBERT-Mongolian

Usage

Tutorials

Results

Comparison between ALBERT and BERT

Reproduce

Reference

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages