-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
词汇表里没看到中文 #323
Comments
这个不是这样看的 |
文本编辑器已设为utf-8也看不到,怎样才能看到呢? |
我建议读一下llama3 的tokenizer的方式。里面应该没有办法直接读取到中文。中文都被拆解开了。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
public static void utf8ToGbk() throws Exception {
String fileName = "c:/tokenizer.json";
List lines = Files.readAllLines(Paths.get(fileName), Charset.forName("utf-8"));
String sentence = null;
int size = lines.size();
for (int i = 0; i < size; i++) {
sentence = lines.get(i);
//System.out.println(sentence);
System.out.println(new String(sentence.getBytes("GBK")));
}
}
这样也看不到中文,该怎么操作才能看到词汇表里的中文token?
The text was updated successfully, but these errors were encountered: