Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation

This repository is created for the paper Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation. If you have any questions or suggestions, feel free to open an issue. We'd be happy to assist!

Updates

🔥🔥🔥 2025/02/27: Released the codebase of Mixup Model Merge (M³) on GitHub.
🔥🔥🔥 2025/02/26: Released the first version of our 🚀paper🚀.

Introduction

In this study, we propose Mixup Model Merge (M³), an innovative approach inspired by the Mixup data augmentation technique. This method merges the parameters of two large language models (LLMs) by randomly generating linear interpolation ratios, allowing for a more flexible and comprehensive exploration of the parameter space. The implementation of M³ is similar to the process of proportionally mixing magical potions in \textit{Harry Potter}:

Extensive experimental results show that:

M³ significantly improves the performance of merged models across various tasks (as shown in (a));
M³ enhances the OOD and adversarial robustness of the merged models (as shown in (b));
M³ boosts model merging when combined with sparsification techniques like DARE (as shown in (c)).

(a) Performance of the merged models obtained through Task Arithmetic and TIES-Merging with/without M$^3$.

(b) (Left) Performance of merged models with/without M$^3$ on OOD datasets. (Right) Adversarial robustness (PDR) of merged models with/without M$^3$.

(c) The performance trend of the merged models obtained through the following methods: TIES-Merging without M$^3$ and DARE, TIES-Merging with DARE, TIES-Merging with M$^3$, and TIES-Merging with both M$^3$ and DARE.

Task-Specific Fine-Tuned LLMs and Datasets

We conduct model merging experiments on three task-specific fine-tuned LLMs: WizardLM-13B (LM), WizardMath-13B (Math), and llama-2-13b-code-alpaca (Code), all based on Llama-2-13B as the pre-trained backbone.
We evaluate performance across three tasks using five datasets: AlpacaEval (instruction-following), GSM8K and MATH (mathematical reasoning), and HumanEval and MBPP (code generation).

Model Merging Methods

We follow the implementation of five model merging methods in the Super Mario repository, including Average Merging, Task Arithmetic, and TIES-Merging. Additionally, we combine the proposed Mixup Model Merge (M³) with these methods to enhance merging performance.

Model Robustness Evaluation

Out-of-distribution robustness

To assess OOD robustness, we use the instruction-following (LiveBench-Instruction), coding (LiveBench-Coding), and language comprehension (LiveBench-TypoFixing) categories in LiveBench. We select recent, diverse datasets to ensure they represent OOD data, with Math & Code evaluated on LiveBench-Instruction, LM & Math on LiveBench-Coding, and LM & Code on LiveBench-TypoFixing.

❗Note 1: "Math & Code," "LM & Math," and "LM & Code" represent the merged models.

❗Note 2: For OOD robustness of LM & Code, only the typo-fixing task in LiveBench-TypoFixing is considered, abbreviated as LiveBench-TypoFixing.

Adversarial robustness

We utilize the Adversarial Prompt Attacks module in PromptBench to assess the robustness of LLMs against adversarial prompts. Specifically, we employ three attack methods: DeepWordBug (character-level), BERTAttack (word-level), and StressTest (sentence-level). The evaluation is conducted on two datasets supported by PromptBench: SST2 (sentiment analysis) and CoLA (grammatical correctness).

Environments Installation

PyTorch 2.0.1, vllm 0.1.4, transformers 4.33.1, datasets 2.13.1, numpy 1.26.3, fsspec 2023.9.2, human-eval, tqdm, jsonlines, fraction, and accelerate.

Scripts for Merging Models

Example of merging WizardLM-13B-V1.2 and WizardMath-13B-V1.0 using Average Merging, Task Arithmetic (scaling_coefficient 1.0) or TIES-Merging (scaling_coefficient 1.0, param_value_mask_rate 0.5):

python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name average_merging --tensor_parallel_size 2
python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name task_arithmetic --scaling_coefficient 1.0 --tensor_parallel_size 2
python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name ties-merging --scaling_coefficient 1.0 --param_value_mask_rate 0.5 --tensor_parallel_size 2

Example of merging WizardLM-13B-V1.2 and WizardMath-13B-V1.0 using Average Merging with M³ (alpha 0.5):

python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name average_merging --tensor_parallel_size 2 --mixup --alpha 0.5

Example of merging WizardLM-13B-V1.2 and WizardMath-13B-V1.0 using Average Merging with DARE (drop rate 0.2):

python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 2

Example of merging WizardLM-13B-V1.2 and WizardMath-13B-V1.0 using Average Merging with M³ and DARE (drop rate 0.2, alpha 0.5):

python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 2 --mixup --alpha 0.5

❗Note 1: When merging LLMs, the number of GPUs required = num_models_to_merge * tensor_parallel_size.

Evaluation Process for AlpacaEval, HumanEval and MBPP

Following the Super Mario repository, we will store the generated files. Please run the following evaluation commands to obtain the final metrics.

For AlpacaEval: First, please visit the alpaca_eval repository to install the required environment. Then, to evaluate the generated average_merging.json file, which contains the inference results for the merged LM & Code model obtained by applying Average Merging to WizardLM-13B-V1.2 and llama-2-13b-code-alpaca, please run the following command:

alpaca_eval --model_outputs ./save_gen_instruct_responses_results/instruct_code/alpaca_eval/average_merging.json --annotators_config chatgpt_fn --name average_merging

❗Note 1: --annotators_config chatgpt_fn indicates the use of chatgpt_fn from the alpaca_eval repository to compute the win rate.

❗Note 2: The alpaca_eval repository has been upgraded to AlpacaEval 2.0, and we now default to using AlpacaEval 2.0 for evaluation.

For HumanEval: First, install the environment from the human-eval repository.
Then, to evaluate the generated average_merging.json file, run:

evaluate_functional_correctness ./save_gen_codes_results/instruct_code/human_eval/average_merging.jsonl

For MBPP: First, install the environment according to the bigcode-evaluation-harness repository.
Then, to evaluate the generated average_merging.jsonl file, run:

accelerate launch ./bigcode-evaluation-harness/main.py --tasks mbpp --allow_code_execution --load_generations_path ./save_gen_codes_results/instruct_code/mbpp/average_merging.jsonl

Scripts for Robustness Evaluation

OOD Robustness

Math & Code is evaluated on LiveBench-Instruction, LM & Math on LiveBench-Coding, and LM & Code on LiveBench-TypoFixing. Download the corresponding data from the LiveBench HF repository.

First, run our inference code to generate the inference result files for evaluation on LiveBench-Instruction and LiveBench-Coding. Then, use the code from the LiveBench repository to obtain the final metrics.

For example, to inference Math & Code on LiveBench-Instruction, please run:

python ood_robust_inference.py --load_model_path /path/to/your/merged/model --tensor_parallel_size 4 --evaluate_task instruction_following

❗Note 1: The folder path for storing the inference results is: ./save_gen_instruction_following_livebench_results/instruction_following.

First, set up the environment according to the LiveBench repository and then run the following command:

git clone https://github.com/LiveBench/LiveBench.git
cd LiveBench
conda create -n livebench python=3.10
conda activate livebench
cd LiveBench
pip install -e .

Secondly, place the generated inference result folders, instruction_following and coding, in the LiveBench/data/live_bench directory.

to obtain the final metrics on LiveBench-Instruction, run:

python gen_ground_truth_judgment.py --bench-name live_bench/instruction_following
python show_livebench_result.py --bench-name live_bench/instruction_following

to obtain the final metrics on LiveBench-Coding, run:

python gen_ground_truth_judgment.py --bench-name live_bench/coding
python show_livebench_result.py --bench-name live_bench/coding

For evaluation on LiveBench-TypoFixing, simply run the following command to obtain the final metrics:

python ood_robust_inference.py --load_model_path /path/to/your/merged/model --tensor_parallel_size 4 --evaluate_task typos

Adversarial robustness

Prepare the adversarial robustness evaluation environment：

cd promptbench
conda create --name promptbench python=3.10
conda activate promptbench
pip install -r requirements_adv.txt

Take StressTest as the prompt attack method and evaluate the adversarial robustness of the merged model on the SST2 dataset:

cd promptbench
python sst2.py --model_path /path/to/your/merged/model --attack_method stresstest

Take StressTest as the prompt attack method and evaluate the adversarial robustness of the merged model on the CoLA dataset:

python cola.py --model_path /path/to/your/merged/model --attack_method stresstest

❗Note 1: To ensure proper package references, run the above Adversarial robustness evaluation script within the promptbench folder.

❗Note 2: The script will automatically download the corresponding dataset from HF during execution.

Acknowledgments

We sincerely thank the authors of the Super Mario repository for making their project code publicly available.

Citation

If you find this project helpful, please consider citing our paper.

@article{zhou2025mixup,
  title={Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation},
  author={Zhou, Yue and Chang, Yi and Wu, Yuan},
  journal={arXiv preprint arXiv:2502.15434},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Figure		Figure
math_code_data		math_code_data
model_merging_methods		model_merging_methods
models		models
promptbench		promptbench
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference_llms_instruct_math_code.py		inference_llms_instruct_math_code.py
merge_llms_instruct_math_code.py		merge_llms_instruct_math_code.py
ood_robust_inference.py		ood_robust_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation

Updates

Introduction

Task-Specific Fine-Tuned LLMs and Datasets

Model Merging Methods

Model Robustness Evaluation

Out-of-distribution robustness

Adversarial robustness

Environments Installation

Scripts for Merging Models

Evaluation Process for AlpacaEval, HumanEval and MBPP

Scripts for Robustness Evaluation

OOD Robustness

First, run our inference code to generate the inference result files for evaluation on LiveBench-Instruction and LiveBench-Coding. Then, use the code from the LiveBench repository to obtain the final metrics.

For evaluation on LiveBench-TypoFixing, simply run the following command to obtain the final metrics:

Adversarial robustness

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

MLGroupJLU/MixupModelMerge

Folders and files

Latest commit

History

Repository files navigation

Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation

Updates

Introduction

Task-Specific Fine-Tuned LLMs and Datasets

Model Merging Methods

Model Robustness Evaluation

Out-of-distribution robustness

Adversarial robustness

Environments Installation

Scripts for Merging Models

Evaluation Process for AlpacaEval, HumanEval and MBPP

Scripts for Robustness Evaluation

OOD Robustness

First, run our inference code to generate the inference result files for evaluation on LiveBench-Instruction and LiveBench-Coding. Then, use the code from the LiveBench repository to obtain the final metrics.

For evaluation on LiveBench-TypoFixing, simply run the following command to obtain the final metrics:

Adversarial robustness

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages