How to prepare dataset for training the model? #63

karndeepsingh · 2022-02-02T09:22:14Z

Hi, Thanks for sharing this awesome work. I have a few doubts please help me to understand:

I have a set of text paragraphs and want to extract entities and relationships between the entities detected. How would I prepare my dataset for NER and Relation Extraction model on this paragraph? What formate should I follow?
If any tool you could recommend or any way to prepare tor annotate he data according to the desired format that the model is expecting, it would be a great help.
Thanks.

markus-eberts · 2022-02-07T19:40:53Z

Hi,
please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.

There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

Kerman-Sanjuan · 2022-02-15T18:37:16Z

Hi! I've develop a parser to transform from brat standoff to SpERT format, it loses some data due to the complexity of brat standoff and the simplicity of the data required by SpERT, but is better than nothing. Ask me if you want it, i'll try to documentate it for it's use.

karndeepsingh · 2022-02-15T19:56:24Z

Hi! I've develop a parser to transform from brat standoff to SpERT format, it loses some data due to the complexity of brat standoff and the simplicity of the data required by SpERT, but is better than nothing. Ask me if you want it, i'll try to documentate it for it's use.
@Kerman-Sanjuan Thank you so much! That would be a great if you could do so! It would benefit the audience. However, did you train the model using SpERT on your dataset? How was the performance?

if you could also prepare such descriptive step by step implementation it would be super good!

Thanks again!

karndeepsingh · 2022-02-15T20:03:24Z

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.

There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts Thanks for answering. I would like to know more on the data side. I have very long paragraphs consider more than 1024 tokens. Does SpERT has restriction in tokens size or it uses sliding window concept to tackle long paragraphs from relation and entity extraction?

I used Spacy Relation extraction model but it failed on long distance relationship extraction when two entities are present at little significant distance. Does this problem SpERT pertains? Or it can handle such long relationship extraction when entities are little far from eachother?

Kerman-Sanjuan · 2022-02-16T12:20:32Z

Thank you so much! That would be a great if you could do so! It would benefit the audience. However, did you train the model using SpERT on your dataset? How was the performance?

if you could also prepare such descriptive step by step implementation it would be super good!
Thanks again!

@karndeepsingh Im currently working with SpERT on my CS final thesis, testing different clinical-related datasets and trying to improve the model. Currently i´ve parsed all BioNLP2011 task corpus, theorically i´m able to parse any Brat Standoff formatted corpus. With BioNLP2011 SpERT performs quite good, better than expected to be honest. The following weeks i´ll clear the code a bit, and documentate it.

karndeepsingh · 2022-02-21T04:43:43Z

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.

There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts @Kerman-Sanjuan
Hi,
Please,Can you share the parser or code to convert the output from Brat to the required format that SpERT accepts? As I have annotated the dataset in BRAT tool and want it to be used in SpERT. It would be good if you could help in sharing the code that helps to convert thee brat relationship annotated data to required SpERT format.

Thanks

Kerman-Sanjuan · 2022-02-21T11:07:55Z

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.
There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts @Kerman-Sanjuan Hi, Please,Can you share the parser or code to convert the output from Brat to the required format that SpERT accepts? As I have annotated the dataset in BRAT tool and want it to be used in SpERT. It would be good if you could help in sharing the code that helps to convert thee brat relationship annotated data to required SpERT format.

Thanks

@karndeepsingh Yes, no problem, the usage is a bit tricky and has some limitations at the moment, but i can explain/improve it.
How do you want me to send you the source code?

Yolalapi · 2022-09-16T15:08:56Z

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.
There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts @Kerman-Sanjuan Hi, Please,Can you share the parser or code to convert the output from Brat to the required format that SpERT accepts? As I have annotated the dataset in BRAT tool and want it to be used in SpERT. It would be good if you could help in sharing the code that helps to convert thee brat relationship annotated data to required SpERT format.
Thanks

@karndeepsingh Yes, no problem, the usage is a bit tricky and has some limitations at the moment, but i can explain/improve it. How do you want me to send you the source code?

@Kerman-Sanjuan Hi, could you please send it to my email address hanyuu@buaa.edu.cn ? Thanks a lot!

pierpaologoffredo · 2022-10-14T15:06:00Z

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.
There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts @Kerman-Sanjuan Hi, Please,Can you share the parser or code to convert the output from Brat to the required format that SpERT accepts? As I have annotated the dataset in BRAT tool and want it to be used in SpERT. It would be good if you could help in sharing the code that helps to convert thee brat relationship annotated data to required SpERT format.
Thanks

@karndeepsingh Yes, no problem, the usage is a bit tricky and has some limitations at the moment, but i can explain/improve it. How do you want me to send you the source code?

@Kerman-Sanjuan Hi! I was following this issue's thread, since I'm really interested in converting my brat annotations (as @Yolalapi or @karndeepsingh) and if there's someone that's already done this work, it could save a lot of time for me and for all the community. Could you kindly share with me (or the entire community) the code to "parse" the brat annotation in json?

Thank you so much for your help in advance and your time.

Pierpaolo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prepare dataset for training the model? #63

How to prepare dataset for training the model? #63

karndeepsingh commented Feb 2, 2022

markus-eberts commented Feb 7, 2022 •

edited

Kerman-Sanjuan commented Feb 15, 2022

karndeepsingh commented Feb 15, 2022

karndeepsingh commented Feb 15, 2022

Kerman-Sanjuan commented Feb 16, 2022

karndeepsingh commented Feb 21, 2022

Kerman-Sanjuan commented Feb 21, 2022 •

edited

Yolalapi commented Sep 16, 2022

pierpaologoffredo commented Oct 14, 2022

How to prepare dataset for training the model? #63

How to prepare dataset for training the model? #63

Comments

karndeepsingh commented Feb 2, 2022

markus-eberts commented Feb 7, 2022 • edited

Kerman-Sanjuan commented Feb 15, 2022

karndeepsingh commented Feb 15, 2022

karndeepsingh commented Feb 15, 2022

Kerman-Sanjuan commented Feb 16, 2022

karndeepsingh commented Feb 21, 2022

Kerman-Sanjuan commented Feb 21, 2022 • edited

Yolalapi commented Sep 16, 2022

pierpaologoffredo commented Oct 14, 2022

markus-eberts commented Feb 7, 2022 •

edited

Kerman-Sanjuan commented Feb 21, 2022 •

edited