Input data format question for custom dataset ! #66

solved2 · 2024-03-23T18:12:11Z

Hello, I am trying to train a PURE model with the Korean entity-relation extraction dataset and pre-trained KoBERT (Korean BERT, the model is in huggingface). In the Korean dataset I have, the start and end positions of entities are assigned to the character level. (For example, in English, when there is a sentence “I am a student”, the starting index of the “student” entity is assigned to 8).

Question 1) Can I use the dataset as input to the model with indexing like this? If that's not possible, can I use my dataset as training data for the model if I tokenize Korean sentences into spaces (' ') and recalculate the index accordingly?

Additionally, I assume that the PURE model splits the input tokens into smaller pieces using the tokenizer of the pre-trained model. As a result, the total number of tokens in the sentence will be greater than the number of input tokens. So, I think the start/end token positions of the entity entered must be recalculated.

Question 2) Does the PURE model take action to reflect this (As the number of tokens increases, the start/end token positions of the entity change)?

Question 3) Are there any additional considerations or modifications I need to do when using the custom dataset and custom pre-trained model?

Thank you for your great work and It would be really helpful if you reply.

solved2 changed the title ~~Input data format question.~~ Input data format question for custom dataset ! Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input data format question for custom dataset ! #66

Input data format question for custom dataset ! #66

solved2 commented Mar 23, 2024 •

edited

Input data format question for custom dataset ! #66

Input data format question for custom dataset ! #66

Comments

solved2 commented Mar 23, 2024 • edited

solved2 commented Mar 23, 2024 •

edited