1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3East China Normal University
*Equal contribution. †Corresponding author.
🥳 Accepted by ACL2025 main conference
Existing merging methods can result in safety-utility conflicts, where a model good at, for instance, mathematical reasoning, might also generate harmful content. This is illustrated by an unsafe mathematical AI expert conversation. These conflicts arise from neuron misidentification, where simple metrics like parameter magnitude fail to distinguish safety-related regions, and neuron interference, where neurons optimized for different tasks (e.g., safety and code generation) cause antagonistic updates during merging.
Our LED-Merging framework addresses neuron misidentification and interference by decomposing the merging process into three key steps: Location, Election, and Disjoint Merging. The overall workflow is depicted in the figure below.
Location: We begin by calculating importance scores for each neuron in both the base and fine-tuned models. Given a location dataset $\mathcal { X } _ { i } = \{ ( x , y ) _ { k } \}$, where $x$ is the question and $y$ is the answer, we compute the importance score for weight $\pmb { \theta } _ { i } \in \mathbb { R } ^ { D }$ in any layer using the SNIP score, defined as:
where $\mathcal { L } ( x ) = - \log p ( y \mid x )$ is the conditional negative log-likelihood loss. We select the top-$r_i$ neurons as the important neuron subset $\mathcal { N } _ { i } ^ { r _ { i } }$.
Election: To accurately select important neurons in the task vector $\tau _ { i }$, we consider importance scores from both the base model $I ( \theta _ { \mathrm { b a s e } } )$ and the fine-tuned model $I ( \pmb \theta _ { i } )$. Our election strategy selects neurons with high scores in both models:
This approach is more precise than relying on a single magnitude score.
Disjoint Merging: To prevent interference between important neurons from different task vectors, we use a set difference operation to isolate them:
This ensures that Disjoint $( \mathcal { T } _ { i } )$ contains only neurons uniquely attributed to task $i$. We then construct a mask $\mathbf { \nabla } m _ { i } \in \mathbb { R } ^ { D }$ to select these disjoint neurons from $\tau _ { i }$ during merging:
Merging: The final merged task vector $\tau _ { m }$ is then computed as:
The complete workflow is summarized in Algorithm 1, which outlines the steps for calculating importance scores, electing critical neuron sets, disjointing overlapping neurons, and applying these masks to the task vectors for final model merging.
We compare LED-Merging with Model Stock, Breadcrumbs, Task Arithmetic, and Ties-Merging across safety (HarmBench, SORRY-Bench), math reasoning (GSM8K, MATH), and code generation (MBPP, HumanEvalPack), reporting ASR, accuracy, and Pass@1.
Experiments are conducted on Llama-3-8B, Mistral-7B, and WizardLM/Llama2-13B series with safety- and utility-specialized checkpoints. We tune mask ratios $r_i$ and scaling factors $\lambda_i$ by grid search; Figure 5 shows that moderate masks ($0.3$-$0.5$) and balanced scaling provide the best safety-utility trade-off.
LED-Merging consistently achieves superior safety performance while preserving utility across various models and tasks. For Llama3-8B, merging safety-aligned and code-specialized models reduced ASR to 14.75% on HarmBench, representing a 75.9% improvement over the standalone code model and a 31.4% enhancement compared to the original LM model. Similarly, on Mistral-7B, merging safety and math models achieved an ASR of 16%, significantly outperforming Task Arithmetic (ASR = 55.75%) and Ties-Merging (ASR = 62%). For larger models like Llama2-13B, multi-task merging maintained an exceptionally low ASR of 4%.
| Merging Methods | Models | Safety Alignment | Mathematical Reasoning | Code Generating | |||||
|---|---|---|---|---|---|---|---|---|---|
| LM | Math | Code | HarmBench↓ | SORRY-Bench↓ | GSM8K↑ | MATH↑ | MBPP↑ | HumanEvalPack↑ | |
| w/o Merging | ✓ | 21.50 | 18.67 | 81.05 | 24.56 | 1.00 | 3.65 | ||
| ✓ | 42.00 | 50.60 | 79.00 | 36.72 | / | / | |||
| ✓ | 61.25 | 90.40 | / | / | 33.60 | 42.68 | |||
| Model Stock | ✓ | ✓ | 36.00 | 39.55 | 59.67 | 16.64 | / | / | |
| ✓ | ✓ | 17.25 | 12.67 | / | / | 47.00 | 39.02 | ||
| ✓ | ✓ | 23.25 | 17.78 | 52.92 | 15.22 | 47.80 | 36.59 | ||
| Breadcrumbs | ✓ | ✓ | 33.00 | 35.78 | * | * | / | / | |
| ✓ | ✓ | 39.50 | 36.89 | / | / | 53.40 | 36.58 | ||
| ✓ | ✓ | 38.25 | 40.44 | * | * | 49.40 | 36.59 | ||
| Task Arithmetic | ✓ | ✓ | 26.50 | 28.89 | 54.59 | 16.77 | / | / | |
| ✓ | ✓ | 38.00 | 31.11 | / | / | 37.8 | 18.90 | ||
| ✓ | ✓ | 32.00 | 38.44 | 13.12 | 9.92 | 21.8 | 9.15 | ||
| Ties-Merging | ✓ | ✓ | 35.75 | 37.11 | 55.37 | 17.45 | / | / | |
| ✓ | ✓ | 45.00 | 46.44 | / | / | 41.60 | 33.53 | ||
| ✓ | ✓ | 41.25 | 46.44 | 53.01 | 16.72 | 50.20 | 30.34 | ||
| LED-Merging(Ours) | ✓ | ✓ | 21.00 | 11.33 | 49.89 | 16.12 | / | / | |
| ✓ | ✓ | 14.75 | 10.22 | / | / | 47.2 | 37.80 | ||
| ✓ | ✓ | 20.75 | 10.44 | 52.39 | 15.08 | 44.6 | 36.59 | ||
Beyond safety, LED-Merging preserves strong utility: on Llama3-8B, safety+math reaches 52.39% GSM8K, safety+code reaches 47.2% MBPP Pass@1, and multi-task merging keeps balanced performance with substantially lower ASR than Task Arithmetic.
To keep this page concise, we present one representative main table above. Additional cross-family (Mistral/WizardLM) and multilingual comparisons are omitted here and can be provided in the paper/appendix version.
Our analysis reveals significant overlap between safety- and utility-related neurons, particularly in attention layers, suggesting a heightened risk of conflict during model merging. We calculated the Jaccard index between the top 20% safety and utility neurons across Llama3-8B-Series models, finding high values in most transformer layers. This highlights why our disjoint merging strategy is crucial.
Ablation on Mistral-7B verifies that all three components matter: SNIP-based location, dual-score election (11), and disjoint merging jointly deliver the best safety-utility balance.
| Ablation Part | Alternative Methods | Safety | Mathematical Reasoning | ||
|---|---|---|---|---|---|
| HarmBench↓ | SORRY-Bench↓ | GSM8K↑ | MATH↑ | ||
| Location | Random | * | * | 25.58 | 8.66 |
| Wanda | * | * | 39.58 | 11.37 | |
| SNIP | 16.00 | 24.22 | 50.34 | 14.20 | |
| Election | 01 | 58.00 | 83.77 | 54.13 | 13.12 |
| 10 | 35.25 | 47.33 | 50.64 | 13.30 | |
| 11 | 16.00 | 24.22 | 50.34 | 14.20 | |
| Disjoint | ✗ | 63.00 | 85.33 | 72.93 | 23.18 |
| ✓ | 16.00 | 24.22 | 50.34 | 14.20 | |
Compared with Random/Wanda, SNIP avoids instruction collapse; compared with 01/10 election, 11 is more balanced; and removing disjoint merging causes severe safety regression (HarmBench 63.00), confirming that disjoint isolation is critical.
@inproceedings{ma2025led,
title={Led-merging: Mitigating safety-utility conflicts in model merging with location-election-disjoint},
author={Ma, Qianli and Liu, Dongrui and Chen, Qian and Zhang, Linfeng and Shao, Jing},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={21749--21767},
year={2025}
}