Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embedding average 计算中ground truth中存在英文,英文分词是如何处理的。 #86

Open
allyouneeds opened this issue Feb 8, 2022 · 2 comments

Comments

@allyouneeds
Copy link

你还根据#53#55 两个issue的指导复现embedding average的计算,发现ground truth中有英文语句,中文分词方法不太适用英文的分词,请问你们是怎么处理的呢?直接丢弃还是适用英文分词方法对英文的ground truth进行分词。例如在STC_test.json中存在“"I f o n l y w e c o u l d s e e t h e w o r l d i n t h e e y e s o f a b a b y"”,这种是如何处理的呢
image

@silverriver
Copy link
Collaborator

实现中应该是把每个英文字母当作是一个token来处理

@allyouneeds
Copy link
Author

感谢回复,中文和英文分词使用相同的分词方式吗?英文每个字母是一个token,中文是每个字,当做一个token?例如下图中的两种方式使用哪种呢?
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants