Skip to content

Latest commit

 

History

History
105 lines (71 loc) · 4.81 KB

File metadata and controls

105 lines (71 loc) · 4.81 KB

LLM Compression

Introduction

The following introduction comes from the abstract of Compression Represents Intelligence Linearly:

There is a belief that learning to compress well will lead to intelligence. Recently, language modeling has been shown to be equivalent to compression, which offers a compelling rationale for the success of large language models (LLMs): the development of more advanced language models is essentially enhancing compression which facilitates intelligence. ...our findings suggest that compression efficiency, as an unsupervised metric derived from raw text corpora, serves as a reliable evaluation measure that is linearly associated with the model capabilities. We open-source our compression datasets as well as our data collection pipelines to facilitate future researchers to assess compression properly.

Official Links

Overview and Usage

Dataset

The dataset, which consists of three external corpora, can be downloaded using the following python script:

from os import os.path as osp
from datasets import load_dataset

data_path = "data/llm-compression"

subset_mapping = {
    'arxiv_math': ['arxiv_math'],
    'commoncraw': ['cc'],
    'python': ['python'],
}

for key, value in subset_mapping.items():
    llmc_dataset = load_dataset(r"hkust-nlp/llm-compression", name=value)
    llmc_dataset["test"].to_json(osp.join(data_path, f"{key}.jsonl"))

Note: Refer to the original repository for more details on data collection and design.

Inference

The inference stage (SWCELossInferencer) consists of the following key steps:

  1. For each candidate model, we obtain the encodings of each sample of the dataset using its tokenizer.
  2. Concatenate the encodings of all samples into a single array and construct a PyTorch Dataset. Each item of __getitem__ is a chunk of the array based on a sliding window. To reproduce results from the original paper, set block_size=1900 and stride=512.
  3. For each batch, calculate the cross entropy loss based on model logits and targets. The losses within each batch is reduced to a single loss by summation.
  4. Output the losses and total_chr_num to BPCEvaluator for evaluation.

Evaluation

BPCEvaluator: Using the total loss for each batch and the total number of characters in the original dataset from the inference stage, calculate the Bits per Character (BPC) metric for each model:

$$ BPC = \frac{TotalCrossEntropyLoss}{TotalCharacterNumber*log(2)} $$

Summarization

Config Files

  1. Dataset config: configs/datasets/llm-compression.py
  2. Evaluation config: configs/eval_llm_compression.py

Evaluation Results

   metric version            model commoncraw  python arxiv_math  average
0     bpc  af04af   qwen1.5-32b-hf     0.5910  0.2584     0.4080   0.4191
1     bpc  af04af   qwen1.5-14b-hf     0.6459  0.2766     0.4310   0.4512
2     bpc  af04af      qwen-14b-hf     0.6197  0.2849     0.4498   0.4515
3     bpc  af04af     llama-30b-hf     0.5773  0.3212     0.4562   0.4516
4     bpc  af04af   llama-2-13b-hf     0.5807  0.3336     0.4752   0.4632
5     bpc  af04af    qwen1.5-7b-hf     0.6658  0.2935     0.4500   0.4698
6     bpc  af04af       qwen-7b-hf     0.6453  0.3088     0.4830   0.4790
7     bpc  af04af     llama-13b-hf     0.6083  0.3555     0.4865   0.4834
8     bpc  af04af    llama-2-7b-hf     0.6117  0.3536     0.4995   0.4883
9     bpc  af04af      llama-7b-hf     0.6285  0.3794     0.5096   0.5058
10    bpc  af04af  qwen1.5-1.8b-hf     0.7448  0.4029     0.5625   0.5701
11    bpc  af04af     qwen-1.8b-hf     0.7542  0.4175     0.5842   0.5853
12    bpc  af04af  qwen1.5-0.5b-hf     0.8102  0.4520     0.6181   0.6268

FAQ

I am getting this warning during inference, should I truncate long samples to max_seq_len to avoid further errors?

Token indices sequence length is longer than the specified maximum sequence length for this model. Running this sequence through the model will result in indexing errors

A: This warning comes from the tokenizer indicating that the input sequence length exceeds the model's input length, but it does not affect the operation of the tokenizer. For loss calculation, as long as we set a block_size of the sliding window less than max_seq_len, we can safely ignore this warning.

Reference

@misc{huang2024compression,
      title={Compression Represents Intelligence Linearly},
      author={Yuzhen Huang and Jinghan Zhang and Zifei Shan and Junxian He},
      year={2024},
      eprint={2404.09937},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}