Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prepare dataset for training the model? #63

Open
karndeepsingh opened this issue Feb 2, 2022 · 9 comments
Open

How to prepare dataset for training the model? #63

karndeepsingh opened this issue Feb 2, 2022 · 9 comments

Comments

@karndeepsingh
Copy link

Hi, Thanks for sharing this awesome work. I have a few doubts please help me to understand:

I have a set of text paragraphs and want to extract entities and relationships between the entities detected. How would I prepare my dataset for NER and Relation Extraction model on this paragraph? What formate should I follow?
If any tool you could recommend or any way to prepare tor annotate he data according to the desired format that the model is expecting, it would be a great help.
Thanks.

@markus-eberts
Copy link
Member

markus-eberts commented Feb 7, 2022

Hi,
please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.

There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@Kerman-Sanjuan
Copy link

Hi! I've develop a parser to transform from brat standoff to SpERT format, it loses some data due to the complexity of brat standoff and the simplicity of the data required by SpERT, but is better than nothing. Ask me if you want it, i'll try to documentate it for it's use.

@karndeepsingh
Copy link
Author

Hi! I've develop a parser to transform from brat standoff to SpERT format, it loses some data due to the complexity of brat standoff and the simplicity of the data required by SpERT, but is better than nothing. Ask me if you want it, i'll try to documentate it for it's use.
@Kerman-Sanjuan Thank you so much! That would be a great if you could do so! It would benefit the audience. However, did you train the model using SpERT on your dataset? How was the performance?

if you could also prepare such descriptive step by step implementation it would be super good!

Thanks again!

@karndeepsingh
Copy link
Author

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.

There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts Thanks for answering. I would like to know more on the data side. I have very long paragraphs consider more than 1024 tokens. Does SpERT has restriction in tokens size or it uses sliding window concept to tackle long paragraphs from relation and entity extraction?

I used Spacy Relation extraction model but it failed on long distance relationship extraction when two entities are present at little significant distance. Does this problem SpERT pertains? Or it can handle such long relationship extraction when entities are little far from eachother?

@Kerman-Sanjuan
Copy link

Thank you so much! That would be a great if you could do so! It would benefit the audience. However, did you train the model using SpERT on your dataset? How was the performance?

if you could also prepare such descriptive step by step implementation it would be super good!
Thanks again!

@karndeepsingh Im currently working with SpERT on my CS final thesis, testing different clinical-related datasets and trying to improve the model. Currently i´ve parsed all BioNLP2011 task corpus, theorically i´m able to parse any Brat Standoff formatted corpus. With BioNLP2011 SpERT performs quite good, better than expected to be honest. The following weeks i´ll clear the code a bit, and documentate it.

@karndeepsingh
Copy link
Author

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.

There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts @Kerman-Sanjuan
Hi,
Please,Can you share the parser or code to convert the output from Brat to the required format that SpERT accepts? As I have annotated the dataset in BRAT tool and want it to be used in SpERT. It would be good if you could help in sharing the code that helps to convert thee brat relationship annotated data to required SpERT format.

Thanks

@Kerman-Sanjuan
Copy link

Kerman-Sanjuan commented Feb 21, 2022

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.
There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts @Kerman-Sanjuan Hi, Please,Can you share the parser or code to convert the output from Brat to the required format that SpERT accepts? As I have annotated the dataset in BRAT tool and want it to be used in SpERT. It would be good if you could help in sharing the code that helps to convert thee brat relationship annotated data to required SpERT format.

Thanks

@karndeepsingh Yes, no problem, the usage is a bit tricky and has some limitations at the moment, but i can explain/improve it.
How do you want me to send you the source code?

@Yolalapi
Copy link

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.
There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts @Kerman-Sanjuan Hi, Please,Can you share the parser or code to convert the output from Brat to the required format that SpERT accepts? As I have annotated the dataset in BRAT tool and want it to be used in SpERT. It would be good if you could help in sharing the code that helps to convert thee brat relationship annotated data to required SpERT format.
Thanks

@karndeepsingh Yes, no problem, the usage is a bit tricky and has some limitations at the moment, but i can explain/improve it. How do you want me to send you the source code?

@Kerman-Sanjuan Hi, could you please send it to my email address hanyuu@buaa.edu.cn ? Thanks a lot!

@pierpaologoffredo
Copy link

Hi, please execute bash ./scripts/fetch_datasets.sh to download the preprocessed datasets. The datasets are then placed under data/datasets. You should follow the format used in the preprocessed datasets. Each sample (e.g. in data/datasets/conll04/conll04_dev.json) contains the sentence tokens, entities (with start/end referring to a token index, end is exclusive) and relations (with head/tail referring to an index in the entity list). You should also split your paragraphs in sentences and add one sample for each sentence. There is also a *_types.json file (e.g. data/datasets/conll04/conll04_types.json), containing entity/relation types.
There are multiple annotation tools for entities/relations, for example brat. However, you probably need to convert the data annotated with a tool such as brat to the format used in SpERT.

@markus-eberts @Kerman-Sanjuan Hi, Please,Can you share the parser or code to convert the output from Brat to the required format that SpERT accepts? As I have annotated the dataset in BRAT tool and want it to be used in SpERT. It would be good if you could help in sharing the code that helps to convert thee brat relationship annotated data to required SpERT format.
Thanks

@karndeepsingh Yes, no problem, the usage is a bit tricky and has some limitations at the moment, but i can explain/improve it. How do you want me to send you the source code?

@Kerman-Sanjuan Hi! I was following this issue's thread, since I'm really interested in converting my brat annotations (as @Yolalapi or @karndeepsingh) and if there's someone that's already done this work, it could save a lot of time for me and for all the community. Could you kindly share with me (or the entire community) the code to "parse" the brat annotation in json?

Thank you so much for your help in advance and your time.

Pierpaolo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants