Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型 #55

Open
ZeyuTeng96 opened this issue Jan 17, 2024 · 7 comments

Comments

@ZeyuTeng96
Copy link

ZeyuTeng96 commented Jan 17, 2024

请问moe模型的构建是通过多个llama模型还是1个llama模型呢?

请问这个repo的用途是将1个llama模型的FFN层通过不同的切分方法,切分为多个FFN来扮演多个专家。然后将llama模型的其余模型层和权重与切分后的FFN和gate进行拼接变成moe模型嘛?

是否支持将多个llama结构模型的FFN层合并,基于一个base 的llama模型结构构建Moe呢?

  1. How many llama models are used when constructing llama-moe?
  2. Is this repo partition one llama model's FFN into multiple experts and concatenate the rest parameters with gates to construct an MoE model?
  3. Do you support concatenating multiple llama models and construct an MoE model?
@Spico197 Spico197 changed the title moe的构建是通过多个llama模型还是1个llama模型 How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型 Jan 17, 2024
@Spico197
Copy link
Collaborator

感谢您对本项目的关注❤️

  1. LLaMA-MoE基于一个完整的llama模型进行切分
  2. 是的。我们只切分了llama的FFN层,之后再加一个gate进行token路由
  3. 目前不支持。不过这个想法基于同构模型,我想是比较容易实现的

Hi there, thanks for your attention on this project❤️

  1. LLaMA-MoE is constructed on ONE llama2-7B model
  2. Yes, you are right. We first partition llama2's FFN layers into multiple experts, then initialize a gate for token routing.
  3. Currently this repo does not support such functions. But, since all the candidate models have similar structures, I thinks it is not difficult to implement.

@ZeyuTeng96
Copy link
Author

谢谢大佬,后续有无考虑创建一个微信群聊,大家一起讨论moe

@Sniper970119
Copy link

Sniper970119 commented Jan 25, 2024

感谢您对本项目的关注❤️

  1. LLaMA-MoE基于一个完整的llama模型进行切分
  2. 是的。我们只切分了llama的FFN层,之后再加一个gate进行token路由
  3. 目前不支持。不过这个想法基于同构模型,我想是比较容易实现的

Hi there, thanks for your attention on this project❤️

  1. LLaMA-MoE is constructed on ONE llama2-7B model
  2. Yes, you are right. We first partition llama2's FFN layers into multiple experts, then initialize a gate for token routing.
  3. Currently this repo does not support such functions. But, since all the candidate models have similar structures, I thinks it is not difficult to implement.

你好,有些其他问题请教。
关于2的gate,我们做了一些实验,发现梯度会很大,不确定是否因为gate的初始化问题(因为模型的参数们都是经过良好训练的,但是gate是随机初始化的,不确定是否会因为这个导致梯度异常),所以想请教一下你们gate的一些初始化策略和方案。以及前期的warmup是否需要层学习率这类加速gate或冻结其它层更新的这类策略呢?


Hello, I have some other questions to consult.
Regarding the gate for item 2, we conducted some experiments and found that the gradients are often large. We are unsure whether this is due to an issue with gate initialization. (Since the model parameters are well-trained, but the gate is randomly initialized, we are uncertain if this could cause gradient abnormalities.) So, I would like to inquire about some initialization strategies and approaches for your gates. Additionally, is there a need for early-stage warmup strategies, such as layer-wise learning rates to accelerate gate updates or freeze updates for other layers?

@Spico197
Copy link
Collaborator

We had tested to firstly freeze other parameters and pre-train the gates. However, as more tokens consumed during continual pre-training, the two-stage pre-training didn't show advantages. So we keep the simplicity and train the whole model without specific gating magics.

@Sniper970119
Copy link

We had tested to firstly freeze other parameters and pre-train the gates. However, as more tokens consumed during continual pre-training, the two-stage pre-training didn't show advantages. So we keep the simplicity and train the whole model without specific gating magics.

大概多少token这两个方案会基本一致呢?如果不特殊处理gate,在多少token时loss会下降到相对合理的水平?我现在loss大概在4.x,梯度在大几千的水平并且还在持续上升。根据之前的经验看,这么大的梯度似乎是不正确的。


How many tokens approximately will make these two approaches essentially consistent? If gates are not handled with special care, at what token count does the loss generally decrease to a reasonably acceptable level? Currently, my loss is around 4.x, and the gradients are at several thousand levels, continuously increasing. Based on previous experience, such large gradients seem to be incorrect.

@Spico197
Copy link
Collaborator

Hi there~ For multi-stage pre-training comparison, it takes about 20B tokens. It may take about 20~30B tokens to reach a relative low loss values (2.1). But 20B tokens for gate pre-training may be not an effective training recipe (loss get convergence in 5-10B), you could try different settings to find a better one.

@Sniper970119
Copy link

Sniper970119 commented Jan 25, 2024

十分感谢解答。另外能否请教一些其它问题?

  • 有跟踪梯度吗?大概是一个什么样子的变化?
  • 初始化后的loss大概在什么水平呢(step 1)?不进行任何处理直接训练会在20B左右(global batch token 15M?)loss变成2.1?
  • 如果按照 tech report里-4级别的学习率进行,使用初始化的gate进行全量更新为什么没有显著破坏预训练的其他参数呢?(考虑到只warmup了100个step就达到了-4级别的学习率,感觉会很大程度破坏之前学习到的参数?)
    期待您的回复~

我这里初始化的梯度似乎问题比较大,但是短期内loss没观察到问题。

image

Thank you very much for your response. Additionally, may I inquire about some other questions?

  • Is there any gradient tracking? What does the change in gradients generally look like?
  • After initialization, at what level is the loss approximately (step 1)? Without any processing, direct training results in a loss of around 2.1 at 20B tokens(with global batch token of 15M tokens)?
  • If using a -4 level learning rate as show in the tech report, why doesn't the full update with the initialized gate significantly disrupt the pre-trained parameters? (Considering that reaching a -4 level learning rate with only 100 warm-up steps seems likely to heavily impact the previously learned parameters?)

Looking forward to your response~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants