OpenELM Support #1684

jncraton · 2024-04-26T19:30:08Z

A family of LLMs called OpenELM have recently been released. They range in size from 270M to 3B parameters:

Model Size	ARC-c	ARC-e	BoolQ	HellaSwag	PIQA	SciQ	WinoGrande	Average
OpenELM-270M	26.45	45.08	53.98	46.71	69.75	84.70	53.91	54.37
OpenELM-270M-Instruct	30.55	46.68	48.56	52.07	70.78	84.40	52.72	55.11
OpenELM-450M	27.56	48.06	55.78	53.97	72.31	87.20	58.01	57.56
OpenELM-450M-Instruct	30.38	50.00	60.37	59.34	72.63	88.00	58.96	59.95
OpenELM-1_1B	32.34	55.43	63.58	64.81	75.57	90.60	61.72	63.44
OpenELM-1_1B-Instruct	37.97	52.23	70.00	71.20	75.03	89.30	62.75	65.50
OpenELM-3B	35.58	59.89	67.40	72.44	78.24	92.70	65.51	67.39
OpenELM-3B-Instruct	39.42	61.74	68.17	76.36	79.00	92.50	66.85	69.15

These models appear to outperform models of similar scale on various benchmarks:

They could have application in areas where compute is limited or efficiency is a priority. The architecture uses standard transformer components for the most part, but it does include layer-wise scaling. From the paper:

Layer-wise scaling. A standard transformer layer is composed of multi-head attention (MHA) and feed-forward network (FFN). For non-uniform allocation of parameters in the transformer layer, we adjust the number of attention heads and the FFN multiplier in each transformer layer.

It would be helpful to add support for this architecture in CTranslate2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenELM Support #1684

OpenELM Support #1684

jncraton commented Apr 26, 2024

OpenELM Support #1684

OpenELM Support #1684

Comments

jncraton commented Apr 26, 2024