粗粒度NER-tokenizer问题 #41

NiceMartin · 2021-08-01T12:38:36Z

在NER中, 一个比较容易出错的地方是由于tokenizer以后, 导致句子和原始输入的句子长度、token的位置不一致.
在 tokenier.py的代码中, 好像并没有解决 tokenizer输入和输出长度不一致的问题.
例如, 在读入粗粒度NER的语料后,
sents_src, sents_tgt = read_corpus(data_path)
其中的 sents_src[3], sents_tgt[3], 经过 tokenizer以后, 长度并一致, 这样会报错.

920232796 · 2021-08-01T15:26:05Z

早就在群里说过了，这里有坑，想想如何对应上，不难。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

粗粒度NER-tokenizer问题 #41

粗粒度NER-tokenizer问题 #41

NiceMartin commented Aug 1, 2021 •

edited

920232796 commented Aug 1, 2021

粗粒度NER-tokenizer问题 #41

粗粒度NER-tokenizer问题 #41

Comments

NiceMartin commented Aug 1, 2021 • edited

920232796 commented Aug 1, 2021

NiceMartin commented Aug 1, 2021 •

edited