Converting OPT-175B tokenizer to HF format? #704

mawilson1234 · 2023-04-07T16:19:05Z

❓ Questions and Help

What is your question?

I've downloaded the weights for OPT-175B using the URL I got after filling out the Google form. I've also got dict.txt, gpt2-merges.txt, and gpt2-vocab.json. My existing workflow uses the Hugging Face API, so I've converted the weights to HF format using the script here.

However, I'm not sure how to convert the tokenizer to HF format from the files. I see there is a way to make a tokenizer using the gpt2-merges.txt and gpt2-vocab.json files, but that means dict.txt is unused, which strikes me as likely to cause issues (I can't imagine it would exist if it were not needed). Is there a way to do this?

As an alternative, the smaller OPT models and their tokenizers are available on the HF Hub, so I can just get them from there. Do all the OPT models, including 175B, use the same tokenizer?

If it doesn't make a difference, I could just use the tokenizer from HF for one of the smaller models instead. I could easily verify for myself whether the smaller models have identical tokenizers by comparing the HF tokenizers for the different sizes, but that won't tell me necessarily whether 175B uses the same one, since it's not on there as such.

The text was updated successfully, but these errors were encountered:

mawilson1234 · 2023-04-07T17:15:09Z

After some testing, it appears that the tokenizers on HF are probably the same as the one for OPT-175B (at the very least, my output for a short test made sense when decoded with the tokenizer available on HF for facebook/opt-125m). But it'd still be nice to be sure, just in case.

ayeeyecorp · 2023-04-13T23:41:58Z

@mawilson1234 I believe you are correct. I used "tokenizer_config.json" & "special_tokens_map.json" from HF OPT model repo.

Tips (OPT HF link):

- OPT has the same architecture as BartDecoder.
- Contrary to GPT2, OPT adds the EOS token </s> to the beginning of every prompt. Note: Make sure to pass use_fast=False when loading OPT’s tokenizer with [AutoTokenizer](https://huggingface.co/docs/transformers/v4.19.2/en/model_doc/auto#transformers.AutoTokenizer) to get the correct tokenizer.

You can try generating tokenizer with:

    vocab_file = os.path.join(model_path, "gpt2-vocab.json")
    merges_file = os.path.join(model_path, "gpt2-merges.txt")
    tokenizer = GPT2Tokenizer(vocab_file, merges_file)
    tokenizer.save_pretrained(model_path)

mawilson1234 added the question Further information is requested label Apr 7, 2023

mawilson1234 changed the title ~~Converting tokenizer to HF format?~~ Converting OPT-175B tokenizer to HF format? Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting OPT-175B tokenizer to HF format? #704

Converting OPT-175B tokenizer to HF format? #704

mawilson1234 commented Apr 7, 2023

mawilson1234 commented Apr 7, 2023 •

edited

ayeeyecorp commented Apr 13, 2023 •

edited

Converting OPT-175B tokenizer to HF format? #704

Converting OPT-175B tokenizer to HF format? #704

Comments

mawilson1234 commented Apr 7, 2023

❓ Questions and Help

What is your question?

mawilson1234 commented Apr 7, 2023 • edited

ayeeyecorp commented Apr 13, 2023 • edited

mawilson1234 commented Apr 7, 2023 •

edited

ayeeyecorp commented Apr 13, 2023 •

edited