Experimental Results
We assessed REBUTTALAGENT's efficacy by comparing it with strong closed-source LLM baselines and by ablating key system components. All experiments were conducted in a fully automated mode without human intervention, providing a conservative lower bound on performance. We used four SOTA LLMs: GPT-5-mini, Grok-4.1-fast, Gemini-3-Flash, and DeepSeekV3.2 as baselines, and adopted Gemini-3-Flash as a unified LLM judge for evaluation.
REBUTTALAGENT consistently outperforms strong closed-source LLMs across all evaluation dimensions on REBUTTALBENCH when sharing the same base models. The largest gains are observed in Relevance and Argumentation Quality, with improvements in coverage up to +0.78 (DeepSeekV3.2) and specificity up to +1.33 (GPT5-mini). Argumentation quality is strengthened with up to +0.63 higher rebuttal quality. This indicates that performance improvements stem from task decomposition and structured intermediate reasoning, not merely stronger generative capacity. The benefit of REBUTTALAGENT is also larger for weaker base models, suggesting that explicit concern structuring, evidence construction, and response planning can partially compensate for limited base-model capability.
| Method |
Relevance |
Argumentation Quality |
Communication Quality |
Average |
| Coverage |
Semantic Alignment |
Specificity |
Logic Consistency |
Evidence Support |
Response Engagement |
Professional Tone |
Statement Clarity |
Suggestion Constructiveness |
| DeepSeekV3.2 |
3.65 |
4.44 |
3.28 |
3.44 |
3.01 |
3.16 |
3.37 |
3.96 |
3.81 |
3.57 |
RebuttalAgent- DeepSeekV3.2 |
4.43(+0.78) |
4.82(+0.38) |
4.39(+1.11) |
3.86(+0.42) |
3.23(+0.22) |
3.79(+0.63) |
3.60(+0.23) |
4.18(+0.22) |
4.06(+0.25) |
4.08(+0.51) |
| Grok4.1-fast |
3.98 |
4.58 |
3.72 |
3.73 |
3.32 |
3.60 |
3.48 |
4.05 |
3.92 |
3.82 |
RebuttalAgent- Grok-4.1-fast |
4.66(+0.68) |
4.92(+0.34) |
4.65(+0.93) |
4.13(+0.40) |
3.42(+0.10) |
4.15(+0.55) |
3.68(+0.20) |
4.23(+0.18) |
4.24(+0.32) |
4.25(+0.43) |
| Gemini3-Flash |
4.00 |
4.71 |
3.77 |
3.71 |
3.30 |
3.56 |
3.51 |
4.08 |
3.95 |
3.85 |
RebuttalAgent- Gemini3-Flash |
4.51(+0.51) |
4.88(+0.17) |
4.49(+0.72) |
4.11(+0.40) |
3.39(+0.09) |
4.07(+0.51) |
3.78(+0.27) |
4.28(+0.20) |
4.09(+0.14) |
4.23(+0.38) |
| GPT5-mini |
3.61 |
4.22 |
2.96 |
3.37 |
2.92 |
3.07 |
3.35 |
3.95 |
3.91 |
3.48 |
RebuttalAgent- GPT5-mini |
4.34(+0.73) |
4.84(+0.62) |
4.29(+1.33) |
3.78(+0.41) |
3.31(+0.39) |
3.70(+0.63) |
3.60(+0.25) |
4.21(+0.26) |
4.24(+0.33) |
4.05(+0.57) |
Our ablation studies reveal that Evidence Construction is the most critical intermediate artifact. Removing external evidence briefs leads to the largest and most consistent degradation across dimensions, with clear drops in Relevance and Communication Quality. Input Structuring and Checkers also contribute measurably, indicating that the gains of REBUTTALAGENT arise from the combination of complementary modules. Evidence-centered artifacts are the primary driver, while structuring and verification provide crucial guardrails.
| Metric |
Rebuttal Agent |
w/o Component |
| Structuring |
Evidence |
Checker |
| Relevance |
| Coverage |
4.51 |
4.49 (-0.02) |
4.26 (-0.25) |
4.54 (+0.03) |
| Semantic Alignment |
4.88 |
4.71 (-0.17) |
4.73 (-0.15) |
4.89 (+0.01) |
| Specificity |
4.49 |
4.46 (-0.03) |
4.19 (-0.30) |
4.47 (-0.02) |
| Argumentation Quality |
| Logic Consistency |
4.11 |
4.06 (-0.05) |
4.05 (-0.06) |
4.13 (+0.02) |
| Evidence Support |
3.39 |
3.23 (-0.16) |
3.32 (-0.07) |
3.39 (+0.00) |
| Response Engagement |
4.07 |
4.04 (-0.03) |
3.97 (-0.10) |
4.01 (-0.06) |
| Communication Quality |
| Professional Tone |
3.78 |
3.69 (-0.09) |
3.74 (-0.04) |
3.73 (-0.05) |
| Statement Clarity |
4.28 |
4.33 (+0.05) |
4.22 (-0.06) |
4.29 (+0.01) |
| Suggestion Constructiveness |
4.09 |
4.06 (-0.03) |
3.82 (-0.27) |
4.05 (-0.04) |
Case studies demonstrate that REBUTTALAGENT's approach of producing an inspectable plan, separating interpretative defense from necessary intervention, effectively reduces hallucination and over-commitment. It outputs concrete deliverables and a scoped to-do list when new experiments or analyses are needed, supporting author decision-making by making the reasoning path and required work explicit before drafting.