Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama工具 #165

Open
ziwang-com opened this issue Jun 28, 2023 · 0 comments
Open

llama工具 #165

ziwang-com opened this issue Jun 28, 2023 · 0 comments

Comments

@ziwang-com
Copy link
Owner

https://github.com/Ronsor/llama-tools
llama工具
用于玩LLaMA LLM及其分词器的随机工具。

add_tokens.py
用于将标记从文本文件添加到分词器的简单脚本。您可能仍然需要微调模型,以便它了解这些新令牌。需要安装,但不需要(尽管您可能仍然需要它)。protobufsentencepiece

Usage: python add_tokens.py [original model] [output model] [token list]
[original model]是原始分词器模型的路径,为方便起见,将其包括在内。etc/tokenizer.model
[output model]是修改后的分词器模型的文件路径,不应与[original model]
[token list]是具有以下格式的文本文件的名称:
N normal token
C
U user defined token
UB YW5vdGhlciB1c2VyIHRva2Vu
行以令牌类型开头,然后后跟一个空格,然后是标记值(直到换行符)或后跟一个空格,以指示令牌值是 base64 编码的。看。Btest_list.txt

有关令牌类型的信息,请参阅和 https://github.com/google/sentencepiece。sputil/sentencepiece_model.proto

tokenizer_info.py
用于打印分词器的训练配置的简单脚本。

merge_tokenizer.py
用于将分词器模型 B 合并到分词器模型 A 的脚本。这对于微调可能很有用。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant