Can Cross-domain Term Extraction Benefit from Cross-lingual Transfer and Nested Term Labeling?

1. Description

In this repo, we further extend our work in Can Cross-domain Term Extraction Benefit from Cross-lingual Transfer? by introducing a novel nested term labeling mechanism and evaluating the performance of the model in the cross-lingual and multi-lingual settings in comparison with the traditional BIO annotation regime.

2. Requirements

Please install all the necessary libraries noted in requirements.txt using this command:

pip install -r requirements.txt

3. Data

The experiments were conducted on 2 datasets:

	ACTER dataset	RSDO5 dataset
Languages	English, French, and Dutch	Slovenian
Domains	Corruption, Wind energy, Equitation, Heart failure	Biomechanics, Chemistry, Veterinary, Linguistics
Original version	AylaRT/ACTER	Corpus of term-annotated texts RSDO5 1.0

4. Implementation

4.1. Preprocessing

The newly nested term labeling mechanism (NOBI) and the labeled data can be accessed at @honghanhh/nobi_annotation_regime.

4.2. Workflow

The workflow of the model is described in our coming paper in 2023. To reproduce the results, please run the following command:

chmod +x run.sh
./run.sh

which will run the model that covers all the following scenarios:

ACTER dataset with XLM-RoBERTa in mono-lingual, cros-lingual, and multi-lingual settings with both ANN and NES version with multi-lingual settings covering only three languages from ACTER and additional Slovenian add-ons (10 scenarios).
RSDO5 dataset with XLM-RoBERTa in mono-lingual, cros-lingual, and multi-lingual settings with cross-lingual and multi-lingual taking into account the ANN and NES version (48 scenarios).

Note that the model produces the results for NOBI annotated set. To reproduce the results for BIO annotated set, please refers to @honghanhh/ate-2022.

4.3. Model configuration

Feel free to hyper-parameter tune the model. The current settings are:

num_train_epochs=20,             # total # of training epochs
per_device_train_batch_size=32,  # batch size per device during training
per_device_eval_batch_size=32,   # batch size for evaluation 
learning_rate=2e-5,              # learning rate
eval_steps = 500,
load_best_model_at_end=True,     # load the best model at the end of training
metric_for_best_model="f1",
greater_is_better=True

5. Results

Plesae refer the results and error analysis to our coming paper in 2023.

References

Tran, Hanh Thi Hong, et al. "Can Cross-Domain Term Extraction Benefit from Cross-lingual Transfer?." Discovery Science: 25th International Conference, DS 2022, Montpellier, France, October 10–12, 2022, Proceedings. Cham: Springer Nature Switzerland, 2022.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Can Cross-domain Term Extraction Benefit from Cross-lingual Transfer and Nested Term Labeling?

1. Description

2. Requirements

3. Data

4. Implementation

4.1. Preprocessing

4.2. Workflow

4.3. Model configuration

5. Results

References

Contributors:

About

Releases

Packages

License

honghanhh/ate_nobi

Folders and files

Latest commit

History

Repository files navigation

Can Cross-domain Term Extraction Benefit from Cross-lingual Transfer and Nested Term Labeling?

1. Description

2. Requirements

3. Data

4. Implementation

4.1. Preprocessing

4.2. Workflow

4.3. Model configuration

5. Results

References

Contributors:

About

Topics

Resources

License

Stars

Watchers

Forks