Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

粗粒度NER-tokenizer问题 #41

Open
NiceMartin opened this issue Aug 1, 2021 · 1 comment
Open

粗粒度NER-tokenizer问题 #41

NiceMartin opened this issue Aug 1, 2021 · 1 comment

Comments

@NiceMartin
Copy link

NiceMartin commented Aug 1, 2021

在NER中, 一个比较容易出错的地方是由于tokenizer以后, 导致 句子和原始输入的句子长度、token的位置不一致.
在 tokenier.py的代码中, 好像并没有解决 tokenizer输入和输出长度不一致的问题.
例如, 在读入 粗粒度NER的语料后,
sents_src, sents_tgt = read_corpus(data_path)
其中的 sents_src[3], sents_tgt[3], 经过 tokenizer以后, 长度并一致, 这样会报错.

@920232796
Copy link
Owner

早就在群里说过了,这里有坑,想想如何对应上,不难。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants