New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

text2vec中，关于token与汉字字符换算 #145

Open

cutelitchi opened this issue Dec 26, 2023 · 1 comment

Labels

question

cutelitchi commented Dec 26, 2023

模型中max_seq_length指的应该是模型能处理的最大token数，我想问下，这个模型中的token跟汉字字符是一个大概什么样比例的换算关系，我在一个博客上看到在text2vec上是1token约等于1.5个汉字，请问这个结论对吗？

cutelitchi added the question label

Owner

shibing624 commented Dec 26, 2023

是bert的token编码方式，1个token是1个汉字。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment