Preparing training data for a domain with many multi keyword token #240

sathiyabalu89 · 2020-12-08T10:01:30Z

How do I prepare the training data if I have many multi word token in domain like chemistry. For example:

1. Original Sentences: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. \n This is another sentence."

Here "3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide" is a single token. There are multiple words inside the token which are white space separated. This would lead to the above token to be split as 3 tokens: ['3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl','tetrazolium', 'bromide'].

How can I avoid this? Can I give the input training data in the following format to avoid this?

Training data(1) : List of tokens for each sentences. So the training text file will have list of list tokens.

[['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide'], ['This', 'is', 'another', 'sentence.']]

Training data(2): Here I have concatenated the multi keyword token by '|' symbol.
"This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl|tetrazolium|bromide. \n This is another sentence." Then I tweak the ELMO code to handle the | symbol and retain them as a single token.

Please guide on the best way to prepare the training data.

gohjiayi · 2021-09-13T09:08:37Z

Although late, I have provided a response on the same question on StackOverflow here. Hope it helps future developers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preparing training data for a domain with many multi keyword token #240

Preparing training data for a domain with many multi keyword token #240

sathiyabalu89 commented Dec 8, 2020 •

edited

gohjiayi commented Sep 13, 2021

Preparing training data for a domain with many multi keyword token #240

Preparing training data for a domain with many multi keyword token #240

Comments

sathiyabalu89 commented Dec 8, 2020 • edited

gohjiayi commented Sep 13, 2021

sathiyabalu89 commented Dec 8, 2020 •

edited