Skip to content

bayartsogt-ya/albert-mongolian

Repository files navigation

ALBERT-Mongolian

HuggingFace demo

This repo provides pretrained ALBERT model ("A Lite" version of BERT) and SentencePiece model (unsupervised text tokenizer and detokenizer) trained on Mongolian text corpus.

Contents:

Usage

You can use ALBERT-Mongolian in both PyTorch and TensorFlow2.0 using transformers library.

link to HuggingFace model card 🤗

import torch
from transformers import AlbertTokenizer, AlbertForMaskedLM

tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/albert-mongolian')
model = AlbertForMaskedLM.from_pretrained('bayartsogt/albert-mongolian')

Tutorials

AWS-Mongolians e-meetup #3

Results

Model Problem Task weighted F1
ALBERT-base Text Classification Eduge dataset 0.90
... ... ... ...

Comparison between ALBERT and BERT

Note that While ALBERT-base is compatible in terms of results shown below, it is over 10 times (only 135MB) smaller than BERT-base (1.2GB).

  • ALBERT-Mongolian:
                          precision    recall  f1-score   support

            байгал орчин       0.85      0.83      0.84       999
               боловсрол       0.80      0.80      0.80       873
                   спорт       0.98      0.98      0.98      2736
               технологи       0.88      0.93      0.91      1102
                 улс төр       0.92      0.85      0.89      2647
              урлаг соёл       0.93      0.94      0.94      1457
                   хууль       0.89      0.87      0.88      1651
             эдийн засаг       0.83      0.88      0.86      2509
              эрүүл мэнд       0.89      0.92      0.90      1159

                accuracy                           0.90     15133
               macro avg       0.89      0.89      0.89     15133
            weighted avg       0.90      0.90      0.90     15133
                          precision    recall  f1-score   support

            байгал орчин       0.82      0.84      0.83       999
               боловсрол       0.91      0.70      0.79       873
                   спорт       0.97      0.98      0.97      2736
               технологи       0.91      0.85      0.88      1102
                 улс төр       0.87      0.86      0.86      2647
              урлаг соёл       0.88      0.96      0.92      1457
                   хууль       0.86      0.85      0.86      1651
             эдийн засаг       0.84      0.87      0.85      2509
              эрүүл мэнд       0.90      0.90      0.90      1159

                accuracy                           0.88     15133
               macro avg       0.88      0.87      0.87     15133
            weighted avg       0.88      0.88      0.88     15133

Reproduce

Pretrain from Scratch: You can follow the PRETRAIN_SCRATCH.md to reproduce the results.

Here is pretraining loss: Pretraining Loss

Reference

  1. ALBERT - official repo
  2. WikiExtrator
  3. Mongolian BERT
  4. ALBERT - Japanese
  5. Mongolian Text Classification
  6. You's paper
  7. AWS-Mongolia e-meetup #3

Citation

@misc{albert-mongolian,
  author = {Bayartsogt Yadamsuren},
  title = {ALBERT Pretrained Model on Mongolian Datasets},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
}