Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate tokenizer entry in config.json auto_map section #544

Open
djliden opened this issue Apr 14, 2024 · 1 comment
Open

Duplicate tokenizer entry in config.json auto_map section #544

djliden opened this issue Apr 14, 2024 · 1 comment
Labels
type/question An issue that's a question

Comments

@djliden
Copy link
Contributor

djliden commented Apr 14, 2024

❓ The question

The Hugging Face config.json files for the olmo models (and for the tokenizers) have the following repeated line under the auto_map AutoTokenizer key:

"AutoTokenizer": [
      "tokenization_olmo_fast.OLMoTokenizerFast",
      "tokenization_olmo_fast.OLMoTokenizerFast"
    ]

(e.g. link for olmo 7b config.json).

Couple of questions:

  1. Is this intentional?
  2. If not, is the config generated from somewhere in the GitHub codebase, or would the change need to be made on the Hugging Face side? (Happy to contribute if a change is needed)

This currently causes an AttributeError: 'list' object has no attribute 'split' error when trying to load the (peft) model with MLflow, which does not expect lists within auto_map. If it is expected that the AutoTokenizer entry is an array, I can pursue this from the MLflow side.

Thanks for looking!

@djliden djliden added the type/question An issue that's a question label Apr 14, 2024
@2015aroras
Copy link
Contributor

We have very recently released transformers-integrated versions of the OLMo models on HF (e.g. https://huggingface.co/allenai/OLMo-1.7-7B-hf). It may be easier to try using these models, which will hopefully cause less issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

2 participants