Han Transformers

This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.

Our paper has been accepted to ROCLING! Please check out our paper.

Dependency

transformers ≤ 4.15.0
pytorch

Models

We uploaded our models to HuggingFace hub.

Pretrained models using a masked language modeling (MLM) objective.
- ckiplab/bert-base-han-chinese
Fine-tuned models for Word Segmentation.
- ckiplab/bert-base-han-chinese-ws (Merge)
- ckiplab/bert-base-han-chinese-ws-shanggu (上古)
- ckiplab/bert-base-han-chinese-ws-zhonggu (中古)
- ckiplab/bert-base-han-chinese-ws-jindai (近代)
- ckiplab/bert-base-han-chinese-ws-xiandai (現代)
Fine-tuned models for Part-of-Speech tagging.
- ckiplab/bert-base-han-chinese-pos (Merge)
- ckiplab/bert-base-han-chinese-pos-shanggu (上古 / 標記列表)
- ckiplab/bert-base-han-chinese-pos-zhonggu (中古 / 標記列表)
- ckiplab/bert-base-han-chinese-pos-jindai (近代 / 標記列表)
- ckiplab/bert-base-han-chinese-pos-xiandai (現代 / 標記列表)

Training Corpus

The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.

Usage

Installation

pip install transformers==4.15.0
pip install torch==1.10.2

Inference

Pre-trained Language Model

You can use ckiplab/bert-base-han-chinese directly with a pipeline for masked language modeling.

from transformers import pipeline

# Initialize 
unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese')

# Input text with [MASK]
unmasker("黎[MASK]於變時雍。")

# output
[{'sequence': '黎 民 於 變 時 雍 。',
'score': 0.14885780215263367,
'token': 3696,
'token_str': '民'},
{'sequence': '黎 庶 於 變 時 雍 。',
'score': 0.0859643816947937,
'token': 2433,
'token_str': '庶'},
{'sequence': '黎 氏 於 變 時 雍 。',
'score': 0.027848130092024803,
'token': 3694,
'token_str': '氏'},
{'sequence': '黎 人 於 變 時 雍 。',
'score': 0.023678112775087357,
'token': 782,
'token_str': '人'},
{'sequence': '黎 生 於 變 時 雍 。',
'score': 0.018718384206295013,
'token': 4495,
'token_str': '生'}]

You can use ckiplab/bert-base-han-chinese to get the features of a given text in PyTorch.

from transformers import AutoTokenizer, AutoModel

# Initialize tokenzier and model
tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese")
model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese")

# Input text
text = "黎民於變時雍。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# get encoded token vectors
output.last_hidden_state    # torch.Tensor with Size([1, 9, 768])

# get encoded sentence vector
output.pooler_output        # torch.Tensor with Size([1, 768])

Word Segmentation (WS)

In WS, ckiplab/bert-base-han-chinese-ws divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).

from transformers import pipeline

# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")

# Input text
classifier("帝堯曰放勳")

# output
[{'entity': 'B',
'score': 0.9999793,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'I',
'score': 0.9915047,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'B',
'score': 0.99992275,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'B',
'score': 0.99905187,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'I',
'score': 0.96299917,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]

Part-of-Speech (PoS) Tagging

In PoS tagging, ckiplab/bert-base-han-chinese-pos recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.

from transformers import pipeline

# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos")

# Input text
classifier("帝堯曰放勳")

# output
[{'entity': 'NB1',
'score': 0.99410427,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'NB1',
'score': 0.98874336,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'VG',
'score': 0.97059363,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'NB1',
'score': 0.9864504,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'NB1',
'score': 0.9543974,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]

Model Performance

Pre-trained Language Model, Perplexity ↓

Language Model	MLM Training Data	MLM Testing Data
Language Model	MLM Training Data	上古	中古	近代	現代
ckiplab/bert-base-han-Chinese	上古	24.7588	87.8176	297.1111	60.3993
	中古	67.861	70.6244	133.0536	23.0125
	近代	69.1364	77.4154	46.8308	20.4289
	現代	118.8596	163.6896	146.5959	4.6143
	Merge	31.1807	61.2381	49.0672	4.5017
ckiplab/bert-base-chinese	-	233.6394	405.9008	278.7069	8.8521

Word Segmentation (WS), F1 score (%) ↑

WS Model	Training Data	Testing Data
WS Model	Training Data	上古	中古	近代	現代
ckiplab/bert-base-han-chinese-ws	上古	97.6090	88.5734	83.2877	70.3772
	中古	92.6402	92.6538	89.4803	78.3827
	近代	90.8651	92.1861	94.6495	81.2143
	現代	87.0234	83.5810	84.9370	96.9446
	Merge	97.4537	91.9990	94.0970	96.7314
ckiplab/bert-base-chinese-ws	-	86.5698	82.9115	84.3213	98.1325

Part-of-Speech (POS) Tagging, F1 score (%) ↑

POS Model	Training Data	Testing Data
POS Model	Training Data	上古	中古	近代	現代
ckiplab/bert-base-han-chinese-pos	上古	91.2945	-	-	-
	中古	7.3662	80.4896	11.3371	10.2577
	近代	6.4794	14.3653	88.6580	0.5316
	現代	11.9895	11.0775	0.4033	93.2813
	Merge	88.8772	42.4369	86.9093	92.9012

License

Citation

Please cite our paper if you use Han-Transformers in your work:

@inproceedings{lin-ma-2022-hantrans,
    title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
    author = "Lin, Chin-Tung  and  Ma, Wei-Yun",
    booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
    year = "2022",
    address = "Taipei, Taiwan",
    publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
    url = "https://aclanthology.org/2022.rocling-1.21",
    pages = "164--173",
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
README.md		README.md
jindai.md		jindai.md
shanggu.md		shanggu.md
xiandai.md		xiandai.md
zhonggu.md		zhonggu.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

jindai.md

jindai.md

shanggu.md

shanggu.md

xiandai.md

xiandai.md

zhonggu.md

zhonggu.md

Repository files navigation

Han Transformers

Dependency

Models

Training Corpus

Usage

Installation

Inference

Model Performance

Pre-trained Language Model, Perplexity ↓

Word Segmentation (WS), F1 score (%) ↑

Part-of-Speech (POS) Tagging, F1 score (%) ↑

License

Citation

About

Releases

Packages

ckiplab/han-transformers

Folders and files

Latest commit

History

Repository files navigation

Han Transformers

Dependency

Models

Training Corpus

Usage

Installation

Inference

Model Performance

Pre-trained Language Model, Perplexity ↓

Word Segmentation (WS), F1 score (%) ↑

Part-of-Speech (POS) Tagging, F1 score (%) ↑

License

Citation

About

Resources

Stars

Watchers

Forks