Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF's llama implementation is different from meta-llama's #30872

Closed
1 of 4 tasks
tianyu-l opened this issue May 16, 2024 · 7 comments
Closed
1 of 4 tasks

HF's llama implementation is different from meta-llama's #30872

tianyu-l opened this issue May 16, 2024 · 7 comments

Comments

@tianyu-l
Copy link

System Info

E.g. the rotary embedding implementations are quite different.

I'm wondering

  1. which implementation are the HF's model weights (e.g. in "meta-llama/Meta-Llama-3-8B") trained on?
  2. which are we running inference on top of?

Here are the code pointers:
HF:

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):

meta-llama: https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a9596f1dea06d9c564f9c0e35f/llama/model.py#L65

Who can help?

@ArthurZucker @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

Expected behavior

consistent implementation of llama models

@ArthurZucker
Copy link
Collaborator

? The llama2 code is subject to a code licence. No one is allowed to use the original code, which is why we have a different version.
As long as the outputs produced are the same, this should not matter no?

  • TGI uses a similar computation of the cos and sin
  • pretty sure vllm as well
  • transformers has more than 15 models that use this formulation of ROPE, not a single one has the one in MetaLlama's repo

See this: https://github.com/meta-llama/llama/blob/main/LICENSE

@rlrs
Copy link

rlrs commented May 17, 2024

@ArthurZucker I can expand on this a bit, because there is an actual difference in the implementations. The difference is whether you treat sequential entries of the embeddings as (real, imaginary), or you treat the first half as real, and the second half as imaginary. I think this has to mean that weights trained with one version must be incompatible with the other version?

This has come up over in torchtitan (as linked in the OP of this issue), because I've been trying to load the HF llama weights and using them with the torchtitan codebase. However, these weights only function if I use the first half/second half implementation, such as the one used in HF transformers.

@tianyu-l
Copy link
Author

@ArthurZucker

As long as the outputs produced are the same, this should not matter no?

Being able to load doesn’t mean the implementations are the same. The difference in the implementations will cause the output to be different, given the same input.

I'm a bit confused here... Please correct me if I'm wrong
Assuming the weights are from Meta, and are trained using Meta’s implementation:
Then the consequence is that the weights are optimized for meta’s llama, although still usable on HF’s llama. So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).

cc: @rlrs

@rlrs
Copy link

rlrs commented May 18, 2024

Okay, so this has been cleared up, I think. The HF weights are indeed different:

# permute for sliced rotary
def permute(w, n_heads, dim1=dim, dim2=dim):
return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)

Glad to find out that it is a simple weight permutation. This should probably be clarified somewhere.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented May 20, 2024

@rlrs I probably should have specify that yes, this works because we permute the weights. But if you don't and still use the cos/sin formulation you still have a different implementation. You just do the permutation on the fly of the q and k which costs a lot.

Pretty sure it is clarified that you need a conversion script to go from the orginal model to transformers, but if you want to add this to the readme feel free to do so.

@tianyu-l I don't understand the confusion, if the implementation is different, but the produced output, meaning the generations are correct, where is the problem?
Imagine a genius finds new ways to compute matrix - vector products, and we use this (cf FlashAttention) which meta did not use to train their model, is it bad? Even if the outputs , the logits and the generations are the same and the generation speed is around 2x faster?
If you use torch.compile, or if you use gpt-fast, you are again not using the same implementation, but the results do match. How is that possible?

So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).

this is just completely wrong, and unsourced. When we add a model to transformers, we make sure that we reproduce the results. For Llama family, the logits match at 1e-2 (because of parallel training + maybe a bit of rope) but the generated output match 1-1 (generate 1000000 tokens greedy and you will have the same results.)

@tianyu-l
Copy link
Author

Thanks for the clarification! When I said different implementations, I meant different implementations that leads to different results on the same input. (I thought this was very clear from the context but it seems not, my bad.)

I agree that as long as the outputs match, implementations don't matter the most.

@ArthurZucker
Copy link
Collaborator

cool, then we are on the same page! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants