Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
utils		utils
.env-default		.env-default
.gitignore		.gitignore
README.md		README.md
app.py		app.py
download.sh		download.sh
eval_f1.png		eval_f1.png
eval_loss.png		eval_loss.png
generate_dataset.py		generate_dataset.py
plot.py		plot.py
requirements.txt		requirements.txt
run.sh		run.sh
run_ner.py		run_ner.py
run_ner_test.py		run_ner_test.py

Repository files navigation

Ancient Chinese NER with LLM-Generated Dataset

Final Project for ADL 2023 Fall
R11922101 Chia-Hung Huang

Project Structure

data/
- variant.csv: Chinese variant character list.
- train_data.json: Training data.
- val_data.json: Validation data.
model/: Saved checkpoint.
utils/
- clean_variant.py: Replace Chinese variant characters with standard characters.
- opanai_ner.py: Call the OpenAI GPT-4-turbo API and generate the NER data in Python dictionary format.
- labeling.py: Label all the tokens based on the dictionary data generated in the previous step.
generate_dataset.py: Use functions in utils/ to generate the training and validation dataset.
run_ner.py: The training script.
run_ner_test.py: The testing (predicting) script.
plot.py: Plot the training curve (loss, f1 score) on the validation set.
app.py: UI.

How to Run the code

Install Dependencies
```
pip install -r requirements.txt
```
Download the Model
```
bash download.sh
```
Train
```
bash run.sh
```
Run the App
```
streamlit run app.py
```
If you want to generate datasets with generate_dataset.py, please create a .env file. You can take .env-default as a template.

About

An NER tool for ancient Chinese in traditional Chinese characters.

Report repository

Languages