You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have 2 questions which I would like to post to the authors:
I observed that the model sizing are limited to the smaller scale (1-3B params), is there a specific reason for the selection of the model size when deciding on training the experts? What are the plausible challenges you would foresee scaling upwards into the middle range (13B-30B).
Ablation studies highlight that limited instruction tuned multimodal data hurts model sparsification when performing MoE training. Could you elaborate more on why this is so, and perhaps share some insights on what would be a reasonable amount of data required to achieve such sparsification.
Thanks very much for the great work.
The text was updated successfully, but these errors were encountered:
Question
Hi, I have 2 questions which I would like to post to the authors:
Thanks very much for the great work.
The text was updated successfully, but these errors were encountered: