Skip to content

ckiplab/han-transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Han Transformers

This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.

Our paper has been accepted to ROCLING! Please check out our paper.

Dependency

  • transformers ≤ 4.15.0
  • pytorch

Models

We uploaded our models to HuggingFace hub.

Training Corpus

The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.

Usage

Installation

pip install transformers==4.15.0
pip install torch==1.10.2

Inference

  • Pre-trained Language Model

    You can use ckiplab/bert-base-han-chinese directly with a pipeline for masked language modeling.

    from transformers import pipeline
    
    # Initialize 
    unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese')
    
    # Input text with [MASK]
    unmasker("黎[MASK]於變時雍。")
    
    # output
    [{'sequence': '黎 民 於 變 時 雍 。',
    'score': 0.14885780215263367,
    'token': 3696,
    'token_str': '民'},
    {'sequence': '黎 庶 於 變 時 雍 。',
    'score': 0.0859643816947937,
    'token': 2433,
    'token_str': '庶'},
    {'sequence': '黎 氏 於 變 時 雍 。',
    'score': 0.027848130092024803,
    'token': 3694,
    'token_str': '氏'},
    {'sequence': '黎 人 於 變 時 雍 。',
    'score': 0.023678112775087357,
    'token': 782,
    'token_str': '人'},
    {'sequence': '黎 生 於 變 時 雍 。',
    'score': 0.018718384206295013,
    'token': 4495,
    'token_str': '生'}]

    You can use ckiplab/bert-base-han-chinese to get the features of a given text in PyTorch.

    from transformers import AutoTokenizer, AutoModel
    
    # Initialize tokenzier and model
    tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese")
    model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese")
    
    # Input text
    text = "黎民於變時雍。"
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    
    # get encoded token vectors
    output.last_hidden_state    # torch.Tensor with Size([1, 9, 768])
    
    # get encoded sentence vector
    output.pooler_output        # torch.Tensor with Size([1, 768])
  • Word Segmentation (WS)

    In WS, ckiplab/bert-base-han-chinese-ws divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).

    from transformers import pipeline
    
    # Initialize
    classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")
    
    # Input text
    classifier("帝堯曰放勳")
    
    # output
    [{'entity': 'B',
    'score': 0.9999793,
    'index': 1,
    'word': '帝',
    'start': 0,
    'end': 1},
    {'entity': 'I',
    'score': 0.9915047,
    'index': 2,
    'word': '堯',
    'start': 1,
    'end': 2},
    {'entity': 'B',
    'score': 0.99992275,
    'index': 3,
    'word': '曰',
    'start': 2,
    'end': 3},
    {'entity': 'B',
    'score': 0.99905187,
    'index': 4,
    'word': '放',
    'start': 3,
    'end': 4},
    {'entity': 'I',
    'score': 0.96299917,
    'index': 5,
    'word': '勳',
    'start': 4,
    'end': 5}]
  • Part-of-Speech (PoS) Tagging

    In PoS tagging, ckiplab/bert-base-han-chinese-pos recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.

    from transformers import pipeline
    
    # Initialize
    classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos")
    
    # Input text
    classifier("帝堯曰放勳")
    
    # output
    [{'entity': 'NB1',
    'score': 0.99410427,
    'index': 1,
    'word': '帝',
    'start': 0,
    'end': 1},
    {'entity': 'NB1',
    'score': 0.98874336,
    'index': 2,
    'word': '堯',
    'start': 1,
    'end': 2},
    {'entity': 'VG',
    'score': 0.97059363,
    'index': 3,
    'word': '曰',
    'start': 2,
    'end': 3},
    {'entity': 'NB1',
    'score': 0.9864504,
    'index': 4,
    'word': '放',
    'start': 3,
    'end': 4},
    {'entity': 'NB1',
    'score': 0.9543974,
    'index': 5,
    'word': '勳',
    'start': 4,
    'end': 5}]

Model Performance

Pre-trained Language Model, Perplexity ↓

Language Model MLM Training Data MLM Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-Chinese 上古 24.7588 87.8176 297.1111 60.3993
中古 67.861 70.6244 133.0536 23.0125
近代 69.1364 77.4154 46.8308 20.4289
現代 118.8596 163.6896 146.5959 4.6143
Merge 31.1807 61.2381 49.0672 4.5017
ckiplab/bert-base-chinese - 233.6394 405.9008 278.7069 8.8521

Word Segmentation (WS), F1 score (%) ↑

WS Model Training Data Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-chinese-ws 上古 97.6090 88.5734 83.2877 70.3772
中古 92.6402 92.6538 89.4803 78.3827
近代 90.8651 92.1861 94.6495 81.2143
現代 87.0234 83.5810 84.9370 96.9446
Merge 97.4537 91.9990 94.0970 96.7314
ckiplab/bert-base-chinese-ws - 86.5698 82.9115 84.3213 98.1325

Part-of-Speech (POS) Tagging, F1 score (%) ↑

POS Model Training Data Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-chinese-pos 上古 91.2945 - - -
中古 7.3662 80.4896 11.3371 10.2577
近代 6.4794 14.3653 88.6580 0.5316
現代 11.9895 11.0775 0.4033 93.2813
Merge 88.8772 42.4369 86.9093 92.9012

License

Copyright (c) 2022 CKIP Lab under the GPL-3.0 License.

Citation

Please cite our paper if you use Han-Transformers in your work:

@inproceedings{lin-ma-2022-hantrans,
    title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
    author = "Lin, Chin-Tung  and  Ma, Wei-Yun",
    booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
    year = "2022",
    address = "Taipei, Taiwan",
    publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
    url = "https://aclanthology.org/2022.rocling-1.21",
    pages = "164--173",
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published