LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint

Overview

Existing merging methods can result in safety-utility conflicts, where a model good at, for instance, mathematical reasoning, might also generate harmful content. This is illustrated by an unsafe mathematical AI expert conversation. These conflicts arise from neuron misidentification, where simple metrics like parameter magnitude fail to distinguish safety-related regions, and neuron interference, where neurons optimized for different tasks (e.g., safety and code generation) cause antagonistic updates during merging.

Methodology

Our LED-Merging framework addresses neuron misidentification and interference by decomposing the merging process into three key steps: Location, Election, and Disjoint Merging. The overall workflow is depicted in the figure below.

Location: We begin by calculating importance scores for each neuron in both the base and fine-tuned models. Given a location dataset $\mathcal { X } _ { i } = \{ ( x , y ) _ { k } \}$, where $x$ is the question and $y$ is the answer, we compute the importance score for weight $\pmb { \theta } _ { i } \in \mathbb { R } ^ { D }$ in any layer using the SNIP score, defined as:

$$I ( \pmb \theta _ { i } ) = \mathbb { E } _ { x \sim \mathcal { X } _ { i } } [ \pmb \theta _ { i } \odot \nabla \pmb \theta _ { i } \mathcal { L } ( x ) ] ,$$

where $\mathcal { L } ( x ) = - \log p ( y \mid x )$ is the conditional negative log-likelihood loss. We select the top-$r_i$ neurons as the important neuron subset $\mathcal { N } _ { i } ^ { r _ { i } }$.

Election: To accurately select important neurons in the task vector $\tau _ { i }$, we consider importance scores from both the base model $I ( \theta _ { \mathrm { b a s e } } )$ and the fine-tuned model $I ( \pmb \theta _ { i } )$. Our election strategy selects neurons with high scores in both models:

$$\mathcal { T } _ { i } ^ { r _ { i } } = \mathcal { N } _ { i } ^ { r _ { i } } \cap \mathcal { N } _ { \mathrm { b a s e } } ^ { r _ { i } } .$$

This approach is more precise than relying on a single magnitude score.

Disjoint Merging: To prevent interference between important neurons from different task vectors, we use a set difference operation to isolate them:

$$\mathrm { D i s j o i n t } ( \mathcal T _ { i } ^ { r _ { i } } ) = \mathcal T _ { i } ^ { r _ { i } } - \bigcup _ { J \underset { \neq } { \subset } [ K ] , | J | \geq 2 } \bigcap _ { j \in J } \mathcal T _ { j } ^ { r _ { j } } .$$

This ensures that Disjoint $( \mathcal { T } _ { i } )$ contains only neurons uniquely attributed to task $i$. We then construct a mask $\mathbf { \nabla } m _ { i } \in \mathbb { R } ^ { D }$ to select these disjoint neurons from $\tau _ { i }$ during merging:

$$m _ { i , d } = { \left\{ \begin{array} { l l } { 1 , } & { { \mathrm { i f ~ } } d \in \operatorname { D i s j o i n t } ( T _ { i } ^ { r _ { i } } ) , } \\ { 0 , } & { { \mathrm { o t h e r w i s e } } . } \end{array} \right. }$$

Merging: The final merged task vector $\tau _ { m }$ is then computed as:

$${ \pmb { \tau } } _ { m } = \sum _ { i } \lambda _ { i } { \pmb { \tau } } _ { i } \odot { \pmb { m } } _ { i } .$$

The complete workflow is summarized in Algorithm 1, which outlines the steps for calculating importance scores, electing critical neuron sets, disjointing overlapping neurons, and applying these masks to the task vectors for final model merging.

Experiments

We compare LED-Merging with Model Stock, Breadcrumbs, Task Arithmetic, and Ties-Merging across safety (HarmBench, SORRY-Bench), math reasoning (GSM8K, MATH), and code generation (MBPP, HumanEvalPack), reporting ASR, accuracy, and Pass@1.

Experiments are conducted on Llama-3-8B, Mistral-7B, and WizardLM/Llama2-13B series with safety- and utility-specialized checkpoints. We tune mask ratios $r_i$ and scaling factors $\lambda_i$ by grid search; Figure 5 shows that moderate masks ($0.3$-$0.5$) and balanced scaling provide the best safety-utility trade-off.

Hyperparameter Analysis — Safety-utility trade-offs under varying hyperparameters. (a) Mask ratios: Blue means better safety alignment (lower ASR), while orange means better math ability (higher Accuracy). The Pareto frontier (white dashed line) reveals optimal ratios (0.3–0.5) balancing both metrics. (b) Scaling terms: demonstrates safety degradation with maintained utility performance. Star markers denote configurations achieving > 90% safety preservation with < 5% utility loss.

Main Results

LED-Merging consistently achieves superior safety performance while preserving utility across various models and tasks. For Llama3-8B, merging safety-aligned and code-specialized models reduced ASR to 14.75% on HarmBench, representing a 75.9% improvement over the standalone code model and a 31.4% enhancement compared to the original LM model. Similarly, on Mistral-7B, merging safety and math models achieved an ASR of 16%, significantly outperforming Task Arithmetic (ASR = 55.75%) and Ties-Merging (ASR = 62%). For larger models like Llama2-13B, multi-task merging maintained an exceptionally low ASR of 4%.

Merging Methods	Models			Safety Alignment		Mathematical Reasoning		Code Generating
Merging Methods	LM	Math	Code	HarmBench↓	SORRY-Bench↓	GSM8K↑	MATH↑	MBPP↑	HumanEvalPack↑
w/o Merging	✓			21.50	18.67	81.05	24.56	1.00	3.65
		✓		42.00	50.60	79.00	36.72	/	/
			✓	61.25	90.40	/	/	33.60	42.68
Model Stock	✓	✓		36.00	39.55	59.67	16.64	/	/
	✓		✓	17.25	12.67	/	/	47.00	39.02
		✓	✓	23.25	17.78	52.92	15.22	47.80	36.59
Breadcrumbs	✓	✓		33.00	35.78	*	*	/	/
	✓		✓	39.50	36.89	/	/	53.40	36.58
		✓	✓	38.25	40.44	*	*	49.40	36.59
Task Arithmetic	✓	✓		26.50	28.89	54.59	16.77	/	/
	✓		✓	38.00	31.11	/	/	37.8	18.90
		✓	✓	32.00	38.44	13.12	9.92	21.8	9.15
Ties-Merging	✓	✓		35.75	37.11	55.37	17.45	/	/
	✓		✓	45.00	46.44	/	/	41.60	33.53
		✓	✓	41.25	46.44	53.01	16.72	50.20	30.34
LED-Merging(Ours)	✓	✓		21.00	11.33	49.89	16.12	/	/
	✓		✓	14.75	10.22	/	/	47.2	37.80
		✓	✓	20.75	10.44	52.39	15.08	44.6	36.59

Performance of merging Llama3-8B-Instruct (LM), MAmmoTH2-8B-Plus (Math), and Replete-CoderLlama3-8B (Code) on all the datasets. The best and second-best results are marked in bold and underlined fonts. *: The merged model fails to provide structured response.

Beyond safety, LED-Merging preserves strong utility: on Llama3-8B, safety+math reaches 52.39% GSM8K, safety+code reaches 47.2% MBPP Pass@1, and multi-task merging keeps balanced performance with substantially lower ASR than Task Arithmetic.

To keep this page concise, we present one representative main table above. Additional cross-family (Mistral/WizardLM) and multilingual comparisons are omitted here and can be provided in the paper/appendix version.

Our analysis reveals significant overlap between safety- and utility-related neurons, particularly in attention layers, suggesting a heightened risk of conflict during model merging. We calculated the Jaccard index between the top 20% safety and utility neurons across Llama3-8B-Series models, finding high values in most transformer layers. This highlights why our disjoint merging strategy is crucial.

Neuron Level Analysis — Neuron level analysis of safety and utility overlapping in each layer of Llama3-8B. Following Wei et al. (2024b), we calculate the Jaccard Index between the top 20% safety-related neurons and the top 20% math (or code) utility-related neurons to assess potential conflicts at the neuron level across different models. Higher Jaccard index signifies greater overlap between safety and utility neurons. Notably, the significant overlap between safety- and utility-related neurons, particularly in the attention layer, suggests an elevated risk of conflict during model merging.

Ablation Study

Ablation on Mistral-7B verifies that all three components matter: SNIP-based location, dual-score election (11), and disjoint merging jointly deliver the best safety-utility balance.

Ablation Part	Alternative Methods	Safety		Mathematical Reasoning
Ablation Part	Alternative Methods	HarmBench↓	SORRY-Bench↓	GSM8K↑	MATH↑
Location	Random	*	*	25.58	8.66
	Wanda	*	*	39.58	11.37
	SNIP	16.00	24.22	50.34	14.20
Election	01	58.00	83.77	54.13	13.12
	10	35.25	47.33	50.64	13.30
	11	16.00	24.22	50.34	14.20
Disjoint	✗	63.00	85.33	72.93	23.18
Disjoint	✓	16.00	24.22	50.34	14.20

Ablation Study. Experiments are conducted on Mistral-7B series models. * represents LLM’s instruction following ability is impaired.

Compared with Random/Wanda, SNIP avoids instruction collapse; compared with 01/10 election, 11 is more balanced; and removing disjoint merging causes severe safety regression (HarmBench 63.00), confirming that disjoint isolation is critical.

BibTeX

@inproceedings{ma2025led,
  title={Led-merging: Mitigating safety-utility conflicts in model merging with location-election-disjoint},
  author={Ma, Qianli and Liu, Dongrui and Chen, Qian and Zhang, Linfeng and Shao, Jing},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={21749--21767},
  year={2025}
}