LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint

Qianli Ma1*, Dongrui Liu2*, Qian Chen2,3, Linfeng Zhang1, Jing Shao2

1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3East China Normal University

*Equal contribution. Corresponding author.

🥳 Accepted by ACL2025 main conference

Overview

Existing merging methods can result in safety-utility conflicts, where a model good at, for instance, mathematical reasoning, might also generate harmful content. This is illustrated by an unsafe mathematical AI expert conversation. These conflicts arise from neuron misidentification, where simple metrics like parameter magnitude fail to distinguish safety-related regions, and neuron interference, where neurons optimized for different tasks (e.g., safety and code generation) cause antagonistic updates during merging.

Overview Figure
(a): Merging different models suffer from safety-utility conflicts, LLMs may be good at math while tending to output harmful sentences. (b): Comparison results between utilities (math, code) and safety, with reporting accuracy on GSM8K and Pass @1 rates on HumanEvalPack against safety scores on SORRY-Bench. Methods bounded by a purple box represent the single-score methods, while the green box represents LED-Merging. (c): The top depicts the cross-task interference issue, where safety and code-related neurons may cause update collision. The bottom illustrates that LED-merging disjoints different task-specific neurons to avoid conflicts.

Methodology

Our LED-Merging framework addresses neuron misidentification and interference by decomposing the merging process into three key steps: Location, Election, and Disjoint Merging. The overall workflow is depicted in the figure below.

Methodology Overview
Overview of LED-Merging. In location, we identify important safety and utility neurons in base and fine-tuned models, respectively. We use different colors to represent the various neurons. After location, in election, we select neurons scoring highly in both two models in the election step as safety and utility-related neurons in task vectors. Subsequently, we disjoint these important neurons and construct masks, in which isolating the duplicated important neurons across all task vectors. Finally, we combine them into one merged task vector.

Location: We begin by calculating importance scores for each neuron in both the base and fine-tuned models. Given a location dataset $\mathcal { X } _ { i } = \{ ( x , y ) _ { k } \}$, where $x$ is the question and $y$ is the answer, we compute the importance score for weight $\pmb { \theta } _ { i } \in \mathbb { R } ^ { D }$ in any layer using the SNIP score, defined as:

$$I ( \pmb \theta _ { i } ) = \mathbb { E } _ { x \sim \mathcal { X } _ { i } } [ \pmb \theta _ { i } \odot \nabla \pmb \theta _ { i } \mathcal { L } ( x ) ] ,$$

where $\mathcal { L } ( x ) = - \log p ( y \mid x )$ is the conditional negative log-likelihood loss. We select the top-$r_i$ neurons as the important neuron subset $\mathcal { N } _ { i } ^ { r _ { i } }$.

Election: To accurately select important neurons in the task vector $\tau _ { i }$, we consider importance scores from both the base model $I ( \theta _ { \mathrm { b a s e } } )$ and the fine-tuned model $I ( \pmb \theta _ { i } )$. Our election strategy selects neurons with high scores in both models:

$$\mathcal { T } _ { i } ^ { r _ { i } } = \mathcal { N } _ { i } ^ { r _ { i } } \cap \mathcal { N } _ { \mathrm { b a s e } } ^ { r _ { i } } .$$

This approach is more precise than relying on a single magnitude score.

Disjoint Merging: To prevent interference between important neurons from different task vectors, we use a set difference operation to isolate them:

$$\mathrm { D i s j o i n t } ( \mathcal T _ { i } ^ { r _ { i } } ) = \mathcal T _ { i } ^ { r _ { i } } - \bigcup _ { J \underset { \neq } { \subset } [ K ] , | J | \geq 2 } \bigcap _ { j \in J } \mathcal T _ { j } ^ { r _ { j } } .$$

This ensures that Disjoint $( \mathcal { T } _ { i } )$ contains only neurons uniquely attributed to task $i$. We then construct a mask $\mathbf { \nabla } m _ { i } \in \mathbb { R } ^ { D }$ to select these disjoint neurons from $\tau _ { i }$ during merging:

$$m _ { i , d } = { \left\{ \begin{array} { l l } { 1 , } & { { \mathrm { i f ~ } } d \in \operatorname { D i s j o i n t } ( T _ { i } ^ { r _ { i } } ) , } \\ { 0 , } & { { \mathrm { o t h e r w i s e } } . } \end{array} \right. }$$

Merging: The final merged task vector $\tau _ { m }$ is then computed as:

$${ \pmb { \tau } } _ { m } = \sum _ { i } \lambda _ { i } { \pmb { \tau } } _ { i } \odot { \pmb { m } } _ { i } .$$

The complete workflow is summarized in Algorithm 1, which outlines the steps for calculating importance scores, electing critical neuron sets, disjointing overlapping neurons, and applying these masks to the task vectors for final model merging.

Experiments

We compare LED-Merging with Model Stock, Breadcrumbs, Task Arithmetic, and Ties-Merging across safety (HarmBench, SORRY-Bench), math reasoning (GSM8K, MATH), and code generation (MBPP, HumanEvalPack), reporting ASR, accuracy, and Pass@1.

Experiments are conducted on Llama-3-8B, Mistral-7B, and WizardLM/Llama2-13B series with safety- and utility-specialized checkpoints. We tune mask ratios $r_i$ and scaling factors $\lambda_i$ by grid search; Figure 5 shows that moderate masks ($0.3$-$0.5$) and balanced scaling provide the best safety-utility trade-off.

Hyperparameter Analysis
Safety-utility trade-offs under varying hyperparameters. (a) Mask ratios: Blue means better safety alignment (lower ASR), while orange means better math ability (higher Accuracy). The Pareto frontier (white dashed line) reveals optimal ratios (0.3–0.5) balancing both metrics. (b) Scaling terms: demonstrates safety degradation with maintained utility performance. Star markers denote configurations achieving > 90% safety preservation with < 5% utility loss.

Main Results

LED-Merging consistently achieves superior safety performance while preserving utility across various models and tasks. For Llama3-8B, merging safety-aligned and code-specialized models reduced ASR to 14.75% on HarmBench, representing a 75.9% improvement over the standalone code model and a 31.4% enhancement compared to the original LM model. Similarly, on Mistral-7B, merging safety and math models achieved an ASR of 16%, significantly outperforming Task Arithmetic (ASR = 55.75%) and Ties-Merging (ASR = 62%). For larger models like Llama2-13B, multi-task merging maintained an exceptionally low ASR of 4%.

Merging Methods Models Safety Alignment Mathematical Reasoning Code Generating
LM Math Code HarmBench↓ SORRY-Bench↓ GSM8K↑ MATH↑ MBPP↑ HumanEvalPack↑
w/o Merging ✓ 21.50 18.67 81.05 24.56 1.00 3.65
✓ 42.00 50.60 79.00 36.72 / /
✓ 61.25 90.40 / / 33.60 42.68
Model Stock ✓ ✓ 36.00 39.55 59.67 16.64 / /
✓ ✓ 17.25 12.67 / / 47.00 39.02
✓ ✓ 23.25 17.78 52.92 15.22 47.80 36.59
Breadcrumbs ✓ ✓ 33.00 35.78 * * / /
✓ ✓ 39.50 36.89 / / 53.40 36.58
✓ ✓ 38.25 40.44 * * 49.40 36.59
Task Arithmetic ✓ ✓ 26.50 28.89 54.59 16.77 / /
✓ ✓ 38.00 31.11 / / 37.8 18.90
✓ ✓ 32.00 38.44 13.12 9.92 21.8 9.15
Ties-Merging ✓ ✓ 35.75 37.11 55.37 17.45 / /
✓ ✓ 45.00 46.44 / / 41.60 33.53
✓ ✓ 41.25 46.44 53.01 16.72 50.20 30.34
LED-Merging(Ours) ✓ ✓ 21.00 11.33 49.89 16.12 / /
✓ ✓ 14.75 10.22 / / 47.2 37.80
✓ ✓ 20.75 10.44 52.39 15.08 44.6 36.59
Performance of merging Llama3-8B-Instruct (LM), MAmmoTH2-8B-Plus (Math), and Replete-CoderLlama3-8B (Code) on all the datasets. The best and second-best results are marked in bold and underlined fonts. *: The merged model fails to provide structured response.

Beyond safety, LED-Merging preserves strong utility: on Llama3-8B, safety+math reaches 52.39% GSM8K, safety+code reaches 47.2% MBPP Pass@1, and multi-task merging keeps balanced performance with substantially lower ASR than Task Arithmetic.

To keep this page concise, we present one representative main table above. Additional cross-family (Mistral/WizardLM) and multilingual comparisons are omitted here and can be provided in the paper/appendix version.

Our analysis reveals significant overlap between safety- and utility-related neurons, particularly in attention layers, suggesting a heightened risk of conflict during model merging. We calculated the Jaccard index between the top 20% safety and utility neurons across Llama3-8B-Series models, finding high values in most transformer layers. This highlights why our disjoint merging strategy is crucial.

Neuron Level Analysis
Neuron level analysis of safety and utility overlapping in each layer of Llama3-8B. Following Wei et al. (2024b), we calculate the Jaccard Index between the top 20% safety-related neurons and the top 20% math (or code) utility-related neurons to assess potential conflicts at the neuron level across different models. Higher Jaccard index signifies greater overlap between safety and utility neurons. Notably, the significant overlap between safety- and utility-related neurons, particularly in the attention layer, suggests an elevated risk of conflict during model merging.

Ablation Study

Ablation on Mistral-7B verifies that all three components matter: SNIP-based location, dual-score election (11), and disjoint merging jointly deliver the best safety-utility balance.

Ablation Part Alternative Methods Safety Mathematical Reasoning
HarmBench↓ SORRY-Bench↓ GSM8K↑ MATH↑
Location Random * * 25.58 8.66
Wanda * * 39.58 11.37
SNIP 16.00 24.22 50.34 14.20
Election 01 58.00 83.77 54.13 13.12
10 35.25 47.33 50.64 13.30
11 16.00 24.22 50.34 14.20
Disjoint ✗ 63.00 85.33 72.93 23.18
✓ 16.00 24.22 50.34 14.20
Ablation Study. Experiments are conducted on Mistral-7B series models. * represents LLM’s instruction following ability is impaired.

Compared with Random/Wanda, SNIP avoids instruction collapse; compared with 01/10 election, 11 is more balanced; and removing disjoint merging causes severe safety regression (HarmBench 63.00), confirming that disjoint isolation is critical.

BibTeX

@inproceedings{ma2025led,
  title={Led-merging: Mitigating safety-utility conflicts in model merging with location-election-disjoint},
  author={Ma, Qianli and Liu, Dongrui and Chen, Qian and Zhang, Linfeng and Shao, Jing},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={21749--21767},
  year={2025}
}